[Email-SIG] Thoughts on the general API, and the Header API.

Fri Jan 29 03:20:24 CET 2010

On approximately 1/25/2010 8:10 PM, came the following characters from 
the keyboard of Glenn Linderman:
>> That's true.  The Bytes and String versions of binary MIME parts,
>> which are likely to be the large ones, will probably have a common
>> representation for the payload, and could potentially point to the same
>> object.  That breaking of of the expectation that 'encode' and 'decode'
>> return new objects (in analogy to how encode and decode of strings/bytes
>> works) might not be a good thing, though.
>
> Well, one generator could provide the expectation that everything is 
> new; another could provide different expectations.  The differences 
> between them, and the tradeoffs would be documented, of course, were 
> both provided.  I'm not convinced that treating headers and data 
> exactly the same at all times is a good thing... a convenient option 
> at times, perhaps, but I can see it as a serious inefficiency in many 
> use cases involving large data.
>
> This deserves a bit more thought/analysis/discussion, perhaps.  More 
> than I have time for tonight, but I may reply again, perhaps after 
> others have responded, if they do. 

I guess no one else is responding here at the moment.  Read the ideas 
below, and then afterward, consider building the APIs you've suggested 
on top of them.  And then, with the full knowledge that the messages may 
be either in fast or slow storage, I think that you'll agree that 
converting the whole tree in one swoop isn't always appropriate... all 
headers, probably could be.  Data, because of its size, should probably 
be done on demand.

In earlier discussions about the registry, there was the idea of having 
a registry for transport encoding handling, and a registry for MIME 
encoding handling.  There were also vague comments about doing an 
external storage protocol "somehow", but it was a vague concept to be 
defined later, or at least I don't recall any definitions.

Given a raw bytes representation of an incoming email, mail servers need 
to choose how to handle it... this may need to be a dynamic choice based 
on current server load, as well as the obvious static server resources, 
as well as configured limits.

Unfortunately, the SMTP protocol does not require predeclaration of the 
size of the incoming DATA part, so servers cannot enforce size limits 
until they are exceeded.  So as the data streams in, a dynamic 
adjustment to the handling strategy might be appropriate.  Gateways may 
choose to route messages, and stall the input until the output channel 
is ready to receive it, and basically "pass through" the data, with 
limited need to buffer messages on disk... unless the output channel 
doesn't respond... then they might reject the message.  An SMTP server 
should be willing to act as a store-and-forward server, and also must do 
individual delivery of messages to each RCPT (or at least one per 
destination domain), so must have a way of dealing with large messages, 
probably via disk buffering.  The case of disk buffering and retrying 
generally means that the whole message, not just the large data parts, 
must be stored on disk, so the external storage protocol should be able 
to deal with that case.

The minimal external storage format capability is to store the received 
bytestream to disk, associate it with the envelope information, and be 
able to retrieve it in whole later.  This would require having the whole 
thing in RAM at those two points in time, however, and doesn't solve the 
real problem.  Incremental writing and reading to the external storage 
would be much more useful.  Even more useful, would be "partially 
parsed" seek points.

An external storage system that provides "partially parsed" information 
could include:

1) envelope information.  This section is useful to SMTP servers, but 
not other email tools, so should be optional.  This could be a copy of 
the received RCPT command texts, complete with CRLF endings.

2) header information.  This would be everything between DATA and the 
first CRLF CRLF sequence.

3) data.  Pre-MIME this would simply be the rest of the message, but 
post-MIME it would be usefully more complex.  If MIME headers can be 
observed and parsed as the data passes through, then additional metadata 
could be saved that could enhance performance of the later processing 
steps.  Such additional metadata could include the beginning of each 
MIME part, the end of the headers for that part, and the end of the data 
for that part.

The result of saving that information would mean that minimal data (just 
headers) would need to be read in create a tree representing the email, 
the rest could be left in external storage until it is accessed... and 
then obtained directly from there when needed, and converted to the form 
required by the request... either the whole part, or some piece in a buffer.

So there could be a variety of external storage systems... one that 
stores in memory, one that stores on disk per the ideas above, and a 
variety that retain some amount of cached information about the email, 
even though they store it all on disk.  Sounds like this could be a 
plug-in, or an attribute of a message object creation.

But to me, it sounds like the foundation upon which the whole email lib 
should be built, not something that is shoveled in later.

A further note about access to data parts... clearly "data for the whole 
MIME part" could be provided, but even for a single part that could be 
large.  So access to smaller chunks might be desired.

The data access/conversion functions, therefore, should support a 
buffer-at-a-time access interface.  Base64 supports random access 
easily, unless it contains characters not in the 64, that are to be 
ignored, that could throw off the size calculations.  So maybe providing 
sequential buffer-at-a-time access with rewind is the best that can be 
done -- quoted-printable doesn't support random access very well, and 
neither would some sort of compression or encryption technique -- they 
usually like to start from the beginning -- and those are the sorts of 
things that I would consider likely to be standardized in the future, to 
reduce the size of the payload, and to increase the security of the payload.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking