[Email-SIG] fixing the current email module

Sun Oct 11 07:15:49 CEST 2009

On approximately 10/10/2009 8:23 PM, came the following characters from 
the keyboard of Stephen J. Turnbull:
> Glenn Linderman writes:
>  > On approximately 10/10/2009 6:59 AM, came the following characters from 
>  > the keyboard of Stephen J. Turnbull:
>  > > Glenn Linderman writes:
>  > >  > On approximately 10/9/2009 8:10 AM, came the following characters from 
>  > >  > the keyboard of Stephen J. Turnbull:
>
>  > > correctly decoded data normally is stored, and is accessible in the
>  > > same way.  But I gather that's not what you were talking about, my
>  > > mistake.
>  > 
>  > Well, the client tells us where to store it, and we can't prevent it 
>  > from being the same place.
>
> Huh?  No way!  We decide where our data is stored.  This isn't C where
> you pass around arbitrary pointers for efficiency.  In particular,
> strings (whether Unicode or bytes) are not mutable.  So the client can
> keep a copy if it likes, but once it hands us raw message text as
> bytes, after that we decide where we put parsed pieces and/or slices
> of the unparsed original.
>   

Yes, email package can figure out where to store its copy, client 
figures out where to store its copy.

We're getting better at communicating, but not 100% there yet :)  I was 
thinking of the case where the client asks the email package for data, 
and stores it in its variable; you seem to be thinking of the case where 
the client gives the email package data.

>  > > So when you wrote about saving and converting to text form, without
>  > > mentioning that the specific APIs, I assumed you meant the "mainline"
>  > > APIs for parsing and accessing parts of a correctly formatted message.
>  > 
>  > Mostly, I hadn't bothered about APIs yet;
>
> You may not bother about APIs, but it sure looks like you do to me.
> You can't talk about where to store stuff without touching the API.
>   

Well, I'm sure there will be APIs; the names and parameters is what I 
haven't bothered about yet, much, except if the discussion seemed to 
require such.

>  > I think that the email package should require that some special action 
>  > needs to be taken by the client to request not-quite-perfect data, 
>  > either a special parameter value, or different API, etc.
>
> That's all I need to hear, until we're ready to write specs for that
> API.  (Note that a special parameter value is part of the API in a
> sense, if we specify and document what it means, so I tend to use API
> for that, too, not just for whole functions.)
>   

Yes, I was just trying to be clear that it could be either case.

>  > But there is nothing that says that some client might not pass that
>  > all the time, and ignore the defect reports.  Whether that is easy
>  > to identify or not, and whether the email package wants to require
>  > that the normal APIs be tried before the not-quite-perfect APIs are
>  > issues for discussion.
>
> The answers are obvious to me: yes and no.  You can identify whether a
> particular API has been used with standard text search tools like M-x
> occur.  (For non-Emacsers, that is an Emacs command that finds all
> occurances of a particular string in the buffer.)  If a program wants
> to call the quick & dirty APIs first, that's none of our business,
> except that if parsing is being done lazily we should be careful to
> update the defect list, so that the program can check them when it
> wants to.
>
>  > Ultimately, the email package cannot enforce that proper case is taken 
>  > by the client; only code reviews of the client can encourage that.
>
> My point is not to enforce anything, not even code reviews.  But by
> having separate APIs for parsed and unparsed data, code review can be
> made easier and more accurate.
>   

You have to analyze the control flow as well, not just search for 
existence of the API.  In normal code, that should be straightforward, 
but there is no guarantee that the client doesn't use spaghetti code, or 
even obfuscated code, where the analysis would be hard.  The API call 
could exist, but never be invoked; the API call could take parameters 
that never have particular values of interest at run-time.  Hence, it 
may or may not be easy to search the client code and figure it out.  But 
I agree with your stated point: we can't enforce anything about the 
client code, unless we write it ourself, or have some sort of authority 
over it.  I intend to write a client, so I'll have control over that 
one, and don't plan to obfuscate it.

>  > Yes, agreed.  And a special way or ways to get various algorithms for 
>  > attempting to interpret not-quite-perfect data, when the client thinks 
>  > that might be useful.
>
> I don't think we should be talking about special ways (plural) or
> "not-quite-perfect" data.  At this point in the design process, we
> have *parsed* and *unparsed* data.  Heuristic algorithms for
> recovering from unparsable input can be layered on top of these two
> sets of APIs, when we have *real* use cases for them.  For example, I
> don't think your use case of prepending a mailing list's topic or
> serial number to an unparseable subject is realistic; in all lists I
> know of such a message would be held for moderation, or even discarded
> outright as spam.
>   

So if the subject is unparseable, what is the moderator to do?  He can't 
read the subject if it unparseable.  Perhaps he can read the body, but 
it might be in the same unparseable charset.  Let's say he can read the 
body, and the message seems to be valid for the list, and he marks it to 
be forwarded to list members.  Now what is the mailing list to do, it 
still can't parse the subject?

And if there is no moderator, it still may not be spam, just a mailing 
list manager that doesn't understand a valid charset, likely because it 
predates the definition of the charset.

> And again:
>
>  > Right.  And it is the more detailed structure that I was referring to... 
>
> But why?  There is no need to discuss it at this point, and bringing
> it up is confusing as all get-out.
>   

The more we understand/discuss about how different client can function, 
the better we can design the email package.  We'll still not likely 
cover all the possibilities, but we don't want to have tunnel vision and 
declare that because Mailman works this way, that all mailing list 
managers work this way, or that because we haven't discussed that some 
client might do something this way, that it won't.  So I have no problem 
bringing clients into the discussion, to make sure that we don't 
preclude their reasonable behaviors as use cases.

>  > How a particular email server interprets the "stuff before the @" is 
>  > pretty much up to it... so as long as it does something appropriate, it 
>  > can interpret all or a fraction of it as a mailbox name, or could it 
>  > intuit a mailbox name from the body content if it wants, or even from a 
>  > special header.  So yeah, particular interpretations of the address is 
>  > non-RFC stuff.
>
> Right.  To riff on the RFC vs. not theme ["Barry, pick up the bass
> line, need more bottom here!"], I think we should pick a list of RFCs
> we "promise" to implement as "defining" email; if we reserve any
> structures as "too obscure for us to parse," we should say so (and
> reference chapter and verse of the Holy RFC).  On the other hand, of
> course as we discover common use cases for which precise
> specifications can be given, we should be flexible and implement them.
> But there should be no rush.
>
> Which RFCs?
>
> First of all, the STD 11 series (RFCs 733, 822, 2822, 5322).  Here we
> have to worry about the standard's recommended format vs. the obsolete
> format because of the Postel principle.  AFAIK, there is no reason not
> to insist on *producing* strictly RFC 5322 conformant messages, but I
> think we should implement both strict and lax parsers.  The lax parser
> is for "daily use", the strict parser for validation.
>
> Second, the basic MIME structure RFCs: 2045-2049, 2231.  (Some of
> these have been at least partially superseded by now, I think.)
>
> The mailing list header RFCs: 2369 and 2919.
>
> Not RFCs, per se, but an auxiliary module should provide the
> registered IANA data for the above RFCs.
>
> Strictly speaking outside of the email module, but we make use of URLs
> (RFC 3986 -- superseded?) and mimetypes data (this overlaps
> substantially with the "registered IANA data".  We need to coordinate
> with the responsible maintainers for those.
>
> Ditto coordinating with modules that we share a lot of structure with,
> the "not email but very similar" like HTTP (RFC 2616), and netnews
> (NNTP = 3397 and RFC 1036).
>
> Which extensions?
>
> Er, don't you think the above is enough for now?<wink>
>   

It's a good list, yes.

>  > Just to point out that good data can be obtained from bad email 
>  > messages, I think, and that that is a use case.
>
> But we already know that, and the basic idea of how to treat bad data
> (send it to a locked room without any supper).  No need to rehash
> that, AFAICS from your use case.
>   

Locked room is the first pass; unlocking it belongs to the heuristics, 
for determined clients.

The use case wasn't at http://wiki.python.org/moin/Email%20SIG/UseCases 
so I've added it there, as "Handling pathological data #2"

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking