[Email-SIG] fixing the current email module

Fri Oct 9 17:10:18 CEST 2009

Glenn Linderman writes:

 > Emacs is different than email.  Either you can read a file to edit it, 
 > or you can't.

*sigh* Emacs is as powerful a programming environment as Python, and
applications regularly deal with network streams (HTTP, NNTP, and SMTP
most commonly, but also raw X protocol and any kind of socket
supported by the platform).  So, yes, it's different from email,
because it's *far* more general.  That's precisely why I appreciate
Bill's concerns about non-email usage.

 > The Postel principle for email says to try to do the best you can,
 > for as much as you can.

Actually, it doesn't.  It says be lenient in what you accept, strict
in what you emit.  You accept it ... but you don't have to do
anything with it except preserve it verbatim for whoever wants it.

 > >  > produce a defect report, but then simply converted to Unicode as if it 
 > >  > were Latin-1 (since there is no other knowledge available that could 
 > >  > produce a better conversion).
 > >
 > > No, that is already corruption.  Most clients will assume that string
 > > is valid as a header, because it's valid as a string.
 > 
 > Sure it is corruption.  That's why there is a defect report.  But
 > the conversion technique is appropriate, per the Postel principle.

Actually, I would say you are emitting leniently, in violation of the
Postel principle.  You don't know what the client will do, they may
eat it in a single gulp without looking at it.  Thus you should avoid
converting anything that you don't know what it is (unless
specifically asked to do your best).

 > Again, I mentioned producing a defect report.  That is not passing
 > an error silently.

But if I access that Unicode object without looking at the defect
report, you *will* pass the error silently.  OTOH, if I look at the
defect report, I won't access the Unicode object.

 > It is still raw user input, and should still be checked for proper 
 > syntax by the client,

Nonsense.  The email module had better know a lot more about syntax
than the client.  If it doesn't, whack it with a 2x4 until it learns!

 > produces no defect report.  If you don't want to check proper syntax in 
 > your program inputs, I don't want to use your programs, they will be 
 > insecure.

So you're saying that every program that uses the email module should
reproduce 100% of the functionality of the email module's parser, or
it's insecure.  And you imply that's an excuse for passing corrupt
data to any client that asks for it.

I disagree.

 > So there seem to be two techniques:

Whatever gave you that idea?

 > 2) Store the data, and convert only if the data is accessed.

 > With technique 2, little effort is required to store the data,
 > create a state variable to indicate whether it has been converted

Why do that?  It's always "False" in technique 2.

 > and parsed, or not, and then IF (and only IF) the data is accessed,
 > the conversion and parsing must be done on the first access, and
 > instead of creating and storing metainformation about the errors,
 > they could just be raised.

No, they cannot just be raised.  If you just raise the error, then the
next time you try to access unparsed data, you'll hit the error
again.  If you use the same handler you did before, you're in an
infloop.  So you need a second handler to do things differently this
time or a flag ... but it's unclear to me that that flag can be a
boolean.  So you may as well store the defect list and information
about where to restart.

 > So the Pythonic way, AFAIU, is that errors are returned out-of-band
 > via raised exceptions.

Sure.  But what you're missing is that "Neither rain, nor snow, nor
dark of night may stop the Parser on her appointed rounds."  It is not
easy to write parsers, but I'll tell you one thing: it's orders of
magnitude harder to write a parser that starts in the middle and works
outward, than one that starts at the beginning and works forward to
the end.

So it's OK to write a lazy parser, but it must retain enough state so
that it can work forward until the end.  Because you don't know that
the client will not request the last character of the message, you
need to be able to try to get it, no matter what happened to the first
10GB of the message.  And if an exception occurs, it must be handled
by the parser itself; if not, you put the poor thing in the position
of starting over at the beginning (that way lies the madness of
infloops), or trying to start a parse in the middle and work out.