[Email-SIG] Thoughts on the general API, and the Header API.
Glenn Linderman
v+python at g.nevcal.com
Fri Jan 29 03:20:24 CET 2010
On approximately 1/25/2010 8:10 PM, came the following characters from
the keyboard of Glenn Linderman:
>> That's true. The Bytes and String versions of binary MIME parts,
>> which are likely to be the large ones, will probably have a common
>> representation for the payload, and could potentially point to the same
>> object. That breaking of of the expectation that 'encode' and 'decode'
>> return new objects (in analogy to how encode and decode of strings/bytes
>> works) might not be a good thing, though.
>
> Well, one generator could provide the expectation that everything is
> new; another could provide different expectations. The differences
> between them, and the tradeoffs would be documented, of course, were
> both provided. I'm not convinced that treating headers and data
> exactly the same at all times is a good thing... a convenient option
> at times, perhaps, but I can see it as a serious inefficiency in many
> use cases involving large data.
>
> This deserves a bit more thought/analysis/discussion, perhaps. More
> than I have time for tonight, but I may reply again, perhaps after
> others have responded, if they do.
I guess no one else is responding here at the moment. Read the ideas
below, and then afterward, consider building the APIs you've suggested
on top of them. And then, with the full knowledge that the messages may
be either in fast or slow storage, I think that you'll agree that
converting the whole tree in one swoop isn't always appropriate... all
headers, probably could be. Data, because of its size, should probably
be done on demand.
In earlier discussions about the registry, there was the idea of having
a registry for transport encoding handling, and a registry for MIME
encoding handling. There were also vague comments about doing an
external storage protocol "somehow", but it was a vague concept to be
defined later, or at least I don't recall any definitions.
Given a raw bytes representation of an incoming email, mail servers need
to choose how to handle it... this may need to be a dynamic choice based
on current server load, as well as the obvious static server resources,
as well as configured limits.
Unfortunately, the SMTP protocol does not require predeclaration of the
size of the incoming DATA part, so servers cannot enforce size limits
until they are exceeded. So as the data streams in, a dynamic
adjustment to the handling strategy might be appropriate. Gateways may
choose to route messages, and stall the input until the output channel
is ready to receive it, and basically "pass through" the data, with
limited need to buffer messages on disk... unless the output channel
doesn't respond... then they might reject the message. An SMTP server
should be willing to act as a store-and-forward server, and also must do
individual delivery of messages to each RCPT (or at least one per
destination domain), so must have a way of dealing with large messages,
probably via disk buffering. The case of disk buffering and retrying
generally means that the whole message, not just the large data parts,
must be stored on disk, so the external storage protocol should be able
to deal with that case.
The minimal external storage format capability is to store the received
bytestream to disk, associate it with the envelope information, and be
able to retrieve it in whole later. This would require having the whole
thing in RAM at those two points in time, however, and doesn't solve the
real problem. Incremental writing and reading to the external storage
would be much more useful. Even more useful, would be "partially
parsed" seek points.
An external storage system that provides "partially parsed" information
could include:
1) envelope information. This section is useful to SMTP servers, but
not other email tools, so should be optional. This could be a copy of
the received RCPT command texts, complete with CRLF endings.
2) header information. This would be everything between DATA and the
first CRLF CRLF sequence.
3) data. Pre-MIME this would simply be the rest of the message, but
post-MIME it would be usefully more complex. If MIME headers can be
observed and parsed as the data passes through, then additional metadata
could be saved that could enhance performance of the later processing
steps. Such additional metadata could include the beginning of each
MIME part, the end of the headers for that part, and the end of the data
for that part.
The result of saving that information would mean that minimal data (just
headers) would need to be read in create a tree representing the email,
the rest could be left in external storage until it is accessed... and
then obtained directly from there when needed, and converted to the form
required by the request... either the whole part, or some piece in a buffer.
So there could be a variety of external storage systems... one that
stores in memory, one that stores on disk per the ideas above, and a
variety that retain some amount of cached information about the email,
even though they store it all on disk. Sounds like this could be a
plug-in, or an attribute of a message object creation.
But to me, it sounds like the foundation upon which the whole email lib
should be built, not something that is shoveled in later.
A further note about access to data parts... clearly "data for the whole
MIME part" could be provided, but even for a single part that could be
large. So access to smaller chunks might be desired.
The data access/conversion functions, therefore, should support a
buffer-at-a-time access interface. Base64 supports random access
easily, unless it contains characters not in the 64, that are to be
ignored, that could throw off the size calculations. So maybe providing
sequential buffer-at-a-time access with rewind is the best that can be
done -- quoted-printable doesn't support random access very well, and
neither would some sort of compression or encryption technique -- they
usually like to start from the beginning -- and those are the sorts of
things that I would consider likely to be standardized in the future, to
reduce the size of the payload, and to increase the security of the payload.
--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
More information about the Email-SIG
mailing list