[Email-SIG] Design Thoughts Summary

Mon Jan 4 03:25:56 CET 2010

On 11/15/2009 1:01 PM, Barry Warsaw wrote:
> On Nov 14, 2009, at 5:12 PM, Matthew Dixon Cowles wrote:
>
>> Thank you. I am virtually 100% in agreement that this document
>> represents what people have agreed on and that it represents what is
>> sensible to do.
>
> As am I. Fantastic work in pulling this all together David.
>
> I'm a bit slammed right now, but a quick comment...
>
>>> * The API needs to at a minimum have hooks available for an
>>> application to store data on disk rather than holding everything in
>>> memory.
>>
>> I remain unconvinced that this is worth the trouble. Yes, the Twisted
>> folks say that they can't use the email module because they may be
>> receiving hundreds of messages at once. But can anyone do anything
>> with hundreds of messages at once other than write them to disk?
>>
>> And would anything actually be improved by reading hundreds of files
>> at once, in small chunks, looking for MIME separators?
>
> Mailman has a similar problem. Even if we get just a few big messages,
> they can crush the system. You could argue that the MTA should just
> block messages with 50MB bodies if the underlying Mailman code can't
> handle it, but I still think we can do better.
>
> I think we're fine if all the headers and MIME structure were kept in
> memory it would be fine. But I do think we just want to be able to never
> store the raw body content in memory (perhaps unless needed, on demand).
> Mailman for example rarely cares about the bytes of say an image/jpeg body.

for what it's worth, I've also experienced the same "crushing blow" caused by 
large messages in memory. In my case, I immediately dumped all messages to a 
database (unfortunately, SQL), extracted the essential metadata I needed for my 
application and kept it in the record selected index and search on it. I also 
stored the raw message and the processed message in the database as well. Reason 
being, that I wanted to be able to analyze the raw message if something failed 
(usually Unicode failure) and be able to retrieve the e-mail object from its 
json container for quick(er) processing and I would get with parsing the raw 
message again (and again).

This experience makes me a supporter of an e-mail module that has a storage 
container object that can be searched by any number of metadata fields.  these 
metadata fields would consist of internal (to the message) data sources and 
external data sources. I believe it would be necessary to specify what 
searchable fields you want before creating the storage container.

I hope that it would be possible to make the storage container backend Storage 
Technology independent so that people like me who will detest SQL until the heat 
death of the universe can use something else to store mail messages.  I would 
also recommend not depending on the file system because in my experience, 
performance declined dramatically around 500 messages (ext3 adn jfs). Even 
though I was using an SQL database (SQLite), it was significantly faster using 
the database.

Thanks to all who are working on this project. I wish I could participate more 
but, life has other plans for me.