[Email-SIG] fixing the current email module

Sun Oct 11 05:23:36 CEST 2009

Glenn Linderman writes:
 > On approximately 10/10/2009 6:59 AM, came the following characters from 
 > the keyboard of Stephen J. Turnbull:
 > > Glenn Linderman writes:
 > >  > On approximately 10/9/2009 8:10 AM, came the following characters from 
 > >  > the keyboard of Stephen J. Turnbull:

 > > correctly decoded data normally is stored, and is accessible in the
 > > same way.  But I gather that's not what you were talking about, my
 > > mistake.
 > 
 > Well, the client tells us where to store it, and we can't prevent it 
 > from being the same place.

Huh?  No way!  We decide where our data is stored.  This isn't C where
you pass around arbitrary pointers for efficiency.  In particular,
strings (whether Unicode or bytes) are not mutable.  So the client can
keep a copy if it likes, but once it hands us raw message text as
bytes, after that we decide where we put parsed pieces and/or slices
of the unparsed original.

 > > So when you wrote about saving and converting to text form, without
 > > mentioning that the specific APIs, I assumed you meant the "mainline"
 > > APIs for parsing and accessing parts of a correctly formatted message.
 > 
 > Mostly, I hadn't bothered about APIs yet;

You may not bother about APIs, but it sure looks like you do to me.
You can't talk about where to store stuff without touching the API.

 > I think that the email package should require that some special action 
 > needs to be taken by the client to request not-quite-perfect data, 
 > either a special parameter value, or different API, etc.

That's all I need to hear, until we're ready to write specs for that
API.  (Note that a special parameter value is part of the API in a
sense, if we specify and document what it means, so I tend to use API
for that, too, not just for whole functions.)

 > But there is nothing that says that some client might not pass that
 > all the time, and ignore the defect reports.  Whether that is easy
 > to identify or not, and whether the email package wants to require
 > that the normal APIs be tried before the not-quite-perfect APIs are
 > issues for discussion.

The answers are obvious to me: yes and no.  You can identify whether a
particular API has been used with standard text search tools like M-x
occur.  (For non-Emacsers, that is an Emacs command that finds all
occurances of a particular string in the buffer.)  If a program wants
to call the quick & dirty APIs first, that's none of our business,
except that if parsing is being done lazily we should be careful to
update the defect list, so that the program can check them when it
wants to.

 > Ultimately, the email package cannot enforce that proper case is taken 
 > by the client; only code reviews of the client can encourage that.

My point is not to enforce anything, not even code reviews.  But by
having separate APIs for parsed and unparsed data, code review can be
made easier and more accurate.

 > Yes, agreed.  And a special way or ways to get various algorithms for 
 > attempting to interpret not-quite-perfect data, when the client thinks 
 > that might be useful.

I don't think we should be talking about special ways (plural) or
"not-quite-perfect" data.  At this point in the design process, we
have *parsed* and *unparsed* data.  Heuristic algorithms for
recovering from unparsable input can be layered on top of these two
sets of APIs, when we have *real* use cases for them.  For example, I
don't think your use case of prepending a mailing list's topic or
serial number to an unparseable subject is realistic; in all lists I
know of such a message would be held for moderation, or even discarded
outright as spam.

And again:

 > Right.  And it is the more detailed structure that I was referring to... 

But why?  There is no need to discuss it at this point, and bringing
it up is confusing as all get-out.

 > How a particular email server interprets the "stuff before the @" is 
 > pretty much up to it... so as long as it does something appropriate, it 
 > can interpret all or a fraction of it as a mailbox name, or could it 
 > intuit a mailbox name from the body content if it wants, or even from a 
 > special header.  So yeah, particular interpretations of the address is 
 > non-RFC stuff.

Right.  To riff on the RFC vs. not theme ["Barry, pick up the bass
line, need more bottom here!"], I think we should pick a list of RFCs
we "promise" to implement as "defining" email; if we reserve any
structures as "too obscure for us to parse," we should say so (and
reference chapter and verse of the Holy RFC).  On the other hand, of
course as we discover common use cases for which precise
specifications can be given, we should be flexible and implement them.
But there should be no rush.

Which RFCs?

First of all, the STD 11 series (RFCs 733, 822, 2822, 5322).  Here we
have to worry about the standard's recommended format vs. the obsolete
format because of the Postel principle.  AFAIK, there is no reason not
to insist on *producing* strictly RFC 5322 conformant messages, but I
think we should implement both strict and lax parsers.  The lax parser
is for "daily use", the strict parser for validation.

Second, the basic MIME structure RFCs: 2045-2049, 2231.  (Some of
these have been at least partially superseded by now, I think.)

The mailing list header RFCs: 2369 and 2919.

Not RFCs, per se, but an auxiliary module should provide the
registered IANA data for the above RFCs.

Strictly speaking outside of the email module, but we make use of URLs
(RFC 3986 -- superseded?) and mimetypes data (this overlaps
substantially with the "registered IANA data".  We need to coordinate
with the responsible maintainers for those.

Ditto coordinating with modules that we share a lot of structure with,
the "not email but very similar" like HTTP (RFC 2616), and netnews
(NNTP = 3397 and RFC 1036).

Which extensions?

Er, don't you think the above is enough for now?<wink>

 > Just to point out that good data can be obtained from bad email 
 > messages, I think, and that that is a use case.

But we already know that, and the basic idea of how to treat bad data
(send it to a locked room without any supper).  No need to rehash
that, AFAICS from your use case.

 > The only context for parsing MIME parts that I'm aware of is that when 
 > determining the end of a nested MIME part,

Indeed, but this is Postel principle stuff, not about parsing correct
syntax.  First we need to decide what to do with correct syntax, then
come up with belt and suspenders algorithms for broken mail.

 > The only other cross-part context that I am aware of is Content-ID 
 > references.  That doesn't affect parsing, but rather semantic 
 > interpretation, after parsing, validation, and decoding is complete.

I wasn't thinking of those, but that's a good point.  Those will need
to be kept in a mapping at a higher level of the representation,
probably top-level, I guess.