[Email-SIG] Handling large emails: DiskMessage and DiskFeedParser

Barry Warsaw barry at python.org
Sat Oct 9 22:36:13 CEST 2004


[Following up to the list, with permission -BAW]

> > We define a new protocol whereby if the message object returned by the
> > factor has the following three methods, we use those when capturing the
> > payload of non-MIME messages.  If not, then we capture those lines in an
> > internal list object just like normal, calling set_payload() at the
> > end.  The methods are:
> > 
> > def storage_open(self)
> > def storage_write(self, data)
> > def storage_close(self)
> 
> This generally seems good. I like the idea of having this protocol 
> optionally invoked if the methods are there. The one thing missing for 
> me is a "read" API. The whole point of this for me was so the message 
> payloads never touch RAM and in order to do this you need to be able to 
> read the payload out in chunks.
> 
> This of course requires support in the Generator class too. It would 
> need to read message payloads out in chunks when rebuilding a message.

Dang, you're right of course.  This would mean we'd have to add a
storage_read() method, and I think we'd only need to hook this into
Generator._handle_text(), but it might get tricky when the payload needs
to be body_encode()'d because the message has a charset.

I'm beginning to think that we don't have enough time to adequately
prototype and test this for Python 2.4.  In the meantime, I've put a
patch on SF that you can start playing with.  It supports the write
protocol but not the read protocol, so it's incomplete, but it might be
enought to take further if you're interested.

http://sourceforge.net/tracker/index.php?func=detail&aid=1043706&group_id=5470&atid=305470

> > So you could use something like this (from the unit test):
> > 
> > class ExternalStorageMessage(Message):
> >     def storage_open(self):
> >         fd, self._path = tempfile.mkstemp()
> >         self._fp = os.fdopen(fd, 'w')
> > 
> >     def storage_write(self, data):
> >         self._fp.write(data)
> > 
> >     def storage_close(self):
> >         self._fp.close()
> >         fp = open(self._path)
> >         payload = fp.read()
> >         self.set_payload(payload)
>            ^^^^^^^^
> Looks good except why do you call read the whole payload into memory 
> here? Isn't it the goal not to do this? Am I missing something?

Just that this is only an example.  In a real application you wouldn't
want to do it this way.

> An alternative solution I've been thinking of... what if we abstract 
> message payloads to a "Payload" class? We could have MemoryPayload for 
> in-memory storage (the default), TmpFilePayload for temporary disk 
> storage etc etc. The read/write interface to the payload would always be 
> the same and all Message methods would only ever access the payload via 
> the API. Each Message instance would have exactly one MessagePayload 
> instance internally. I realise this would be a big change and probably 
> isn't suited for Python 2.4 but do you think this is useful?

It might be the right way to do it, much like headers can be strings or
instances of Header.  I don't think we can really do either for Python
2.4, but we can continue to pursue this for email 3.1 / Python 2.5.

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 307 bytes
Desc: This is a digitally signed message part
Url : http://mail.python.org/pipermail/email-sig/attachments/20041009/757c9258/attachment.pgp


More information about the Email-SIG mailing list