[Email-SIG] fixing the current email module

Sat Oct 10 22:01:46 CEST 2009

On approximately 10/10/2009 6:59 AM, came the following characters from 
the keyboard of Stephen J. Turnbull:
> I'm running out of time to work on this (yeah, I know it's the
> weekend, but my life is like that lately).  I think we're converging,
> though, so I'd like try and tie some of those ends together.
>   

I think we are converging too... mostly terminology issues, and 
assumptions were causing a bit of misunderstandings.

> Glenn Linderman writes:
>  > On approximately 10/9/2009 8:10 AM, came the following characters from 
>  > the keyboard of Stephen J. Turnbull:
>
>  > > Actually, I would say you are emitting leniently, in violation of the
>  > > Postel principle.  
>  > 
>  > You can say that, but I don't have to believe it.  I'm talking about 
>  > accepting; the message has arrived, it is here, the client is trying to 
>  > look at it, and I'm talking about ways the client can look at 
>  > not-quite-perfect data, knowing that it is not quite perfect, but still 
>  > being able to see it.  I'm not at all talking about emitting data.
>
> It would be indeed, if the corrupt data is stored in the place where
> correctly decoded data normally is stored, and is accessible in the
> same way.  But I gather that's not what you were talking about, my
> mistake.
>   

Well, the client tells us where to store it, and we can't prevent it 
from being the same place.  But accessible in the same way?  Not.  Some 
extra parameter or different API, would surely be required to get 
not-quite-perfect data.

>  > You seem to be calling the email package helping the client to
>  > accept not-quite-perfect data, as a form of emitting data.  It is
>  > not.
>
> No, I was confused by the way you wrote.  Saving the data *somewhere*
> is absolutely necessary; not losing data is the #1 commandment of
> low-level mail processing.  Surely the email module is subject to that
> commandment.  *Nobody* is talking about losing any data yet, except
> Barry indirectly when he says that some people think giving up on
> invertibility (often called "idempotency"), and even he is quite
> adamant that he's not going to give up on that.
>
> So when you wrote about saving and converting to text form, without
> mentioning that the specific APIs, I assumed you meant the "mainline"
> APIs for parsing and accessing parts of a correctly formatted message.
>   

Mostly, I hadn't bothered about APIs yet; I'm not yet very familiar with 
the existing ones, because neither nPOPuk nor SeaMonkey nor Thunderbird, 
the only email programs that I have looked at source code for, use the 
Python email package!  So while I'm reasonably familiar with the RFCs, 
and quite familiar with nPOPuk source, and have looked at a small 
fraction of the SeaMonkey/Thunderbird source code (and been amazed at 
how big it is), and have examined email from a large variety of sources 
comparing it to the RFCs to see where it goes wrong and why it doesn't 
display in SeaMonkey/Thunderbird the same way as in Outlook/Outlook 
Express (or other programs), and have found Outlook 2000 and Apple Mail 
to be quite creative in interpreting the RFCs, I'm new to the Python 
email package.

>  > The email package cannot police the client... if it chooses to "eat it 
>  > in a single gulp without looking at it" then it may get indigestion.  I 
>  > never suggested that "converting to Unicode as if it were Latin-1" 
>  > should be done without informing the client, or being requested by the 
>  > client to do that via a special API call...
>
> Well, maybe I misread it, but it certainly looked like that to me.  I
> would not object to that special API call defaulting to ISO 8859/1.
>
>  > If you ignore defect reports, you are ignorant (blunt, but not intended 
>  > to be offensive).
>
> What I worried about is that if defect reports are present, *but
> displayable data is also present*, programmers *will* simply display
> it, for example in producing a prototype program.  It will be
> impossible to determine without very close analysis of that program
> that an early version became a production version without adding
> appropriate checks.  In practice, this bug will be discovered when
> some end user's installation breaks.
>
> It seems that you agree with this, and because the special API call is
> necessary, it will be easy to identify whether proper care is being
> taken or not.  Right?
>   

Well, yes and no. 

I think that the email package should require that some special action 
needs to be taken by the client to request not-quite-perfect data, 
either a special parameter value, or different API, etc.  But there is 
nothing that says that some client might not pass that all the time, and 
ignore the defect reports.  Whether that is easy to identify or not, and 
whether the email package wants to require that the normal APIs be tried 
before the not-quite-perfect APIs are issues for discussion.

Ultimately, the email package cannot enforce that proper case is taken 
by the client; only code reviews of the client can encourage that.

>  > >  > It is still raw user input, and should still be checked for proper 
>  > >  > syntax by the client,
>  > >
>  > > Nonsense.  The email module had better know a lot more about syntax
>  > > than the client.  If it doesn't, whack it with a 2x4 until it learns!
>  > 
>  > I think we are talking at cross purposes here.  I find it quite 
>  > difficult to follow where you cross the boundary between talking about 
>  > one sort of email package client, and then switch to another type, or 
>  > switch to the responsibilities of the email package.
>
> Excuse me?  The "raw user input" you referred to above is material
> that the client software receives from the email package.  The email
> package should give it to the client in the "normal" (convenient) way
> only if it can certify that it conforms to the appropriate standard.
>   

Yes, agreed.  And a special way or ways to get various algorithms for 
attempting to interpret not-quite-perfect data, when the client thinks 
that might be useful.  Then the client has "tweaked" user input.

> That standard should be specified in the API documentation.  Any more
> detailed structure, of course, is the responsibility of the client.
>   

Right.  And it is the more detailed structure that I was referring to... 
Even if the structure of the email is incorrect, if the client can find 
its input among the various attempts to obtain data from the 
not-quite-perfect email message, and can validate and check its input, 
it may choose to process it even if the email message is imperfect... it 
should probably note somewhere that the email message from which the 
data was obtained was not perfect, but really, that is up to the client 
to figure out, based on its requirements.

>  > An application which is using email as a transport, has specific goals, 
>  > which require specific content.  You were mentioning clients.
>
> I've already said that when I speak of an MUA, I write "MUA".  In
> speaking of the calling program, which might even be a user running
> the module via the Python interpreter, I write "client".  It's a very
> convenient way to describe the user of an API, in contrast to the
> provider of the API (the implementation).
>   

Yep, so I think my "application" and your "client" are the same thing.  
I'm trying to use your term as I continue responding in these threads, 
it is reasonable.

>  > If such a client doesn't validate the syntax of that content, it
>  > isn't much of an application.
>
> If that MUA or email application uses RFC 822 addresses, it should be
> able to rely on the email module to parse those addresses correctly,
> or provide a defect report.  One might even go so far as to suggest
> that it be able to parse the (non-RFC, but very common) "+" notation
> for separating the "mailbox" from "additional data" used for VERP and
> challenge-response applications.  That would have to be documented,
> but if so documented client applications like the MUA should be able
> to rely on it (and you can bet many will).
>   

Hmim.  This is an interesting digression...

"+", according to the RFCs, is just another of the legal characters that 
can be found before the @ in an unquoted email address... the list is 
!#$%&'*+-/=?^_`{}|~ in addition to the alphanumerics.

How a particular email server interprets the "stuff before the @" is 
pretty much up to it... so as long as it does something appropriate, it 
can interpret all or a fraction of it as a mailbox name, or could it 
intuit a mailbox name from the body content if it wants, or even from a 
special header.  So yeah, particular interpretations of the address is 
non-RFC stuff.

> Application domain syntax of course is not the email module's problem
> whether it arrives by email or Pony Express, and I'm really confused
> why you're going so far afield.
>   

Just to point out that good data can be obtained from bad email 
messages, I think, and that that is a use case.

>  > > No, they cannot just be raised.  If you just raise the error, then the
>  > > next time you try to access unparsed data, you'll hit the error
>  > > again.  If you use the same handler you did before, you're in an
>  > > infloop.  So you need a second handler to do things differently this
>  > > time or a flag ... but it's unclear to me that that flag can be a
>  > > boolean.  So you may as well store the defect list and information
>  > > about where to restart.
>  > 
>  >  From the point of view of the email package, the errors can just be 
>  > raised.  Then the client can make choices, and use other APIs or other 
>  > parameters to the API to direct the email package to attempt a different 
>  > technique to access the data.
>
> The problem is that by this point some of the state of the parse may
> be lost.  We can't say "just raise", we need to say "interrupt the
> parse, preserve state, and then raise".   Python does absolutely
> nothing to help with the problem of preserving the state.  We also
> need to determine just what state to preserve.
>
>  > Yes, I have learned that in my 34 years of programming.  I agree.
>  > 
>  > > So it's OK to write a lazy parser, but it must retain enough state so
>  > > that it can work forward until the end. [...]
>  > 
>  > Are you speaking about parsing the message into MIME parts, or parsing a 
>  > particular MIME part contained within the message, or both?
>
> Both.  I *believe* (but it needs to be checked) that in a correctly
> formed multipart MIME object (message or part), any internal structure
> is context-free within the MIME boundaries.  If that is so, then
> individual parts of the object can be stored in raw form and parsed
> lazily.
>
> Similarly, for any MIME or RFC 822 object, the object can be parsed
> into header section and body section, and each can be stored and
> parsed lazily, subject to the condition that the header section must
> be sufficiently parsed to identify all headers that might affect
> parsing the body part before the body part is parsed.  That
> "condition" is the context.
>   

Neither of these context conditions apply to correctly formed MIME 
trees, but are the only context I'm aware of that can affect parsing of 
MIME parts, AFAIK (and I just reread most of the MIME RFCs in the last 
few days).

The only context for parsing MIME parts that I'm aware of is that when 
determining the end of a nested MIME part, that the search for ending 
delimiter must include searching for any higher-level delimiter as 
well... to handle the case where the inner delimiter got lost.  So one 
should search for CR LF --, and then examine the stuff after the -- to 
match first the innermost delimiter, and then the next outermost, etc., 
and if finding a match, considering that it is the end of all the parts 
nested within the delimiter found, the inner ones being considered 
truncated, since their own delimiter was not found.

Unexpected end-of-data should also mark all unterminated nested MIME 
parts as incomplete, of course.

The only other cross-part context that I am aware of is Content-ID 
references.  That doesn't affect parsing, but rather semantic 
interpretation, after parsing, validation, and decoding is complete.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking