From pytonic at i2pmail.org Mon May 25 21:04:34 2015 From: pytonic at i2pmail.org (PyTonic) Date: Mon, 25 May 2015 19:04:34 +0000 Subject: [Email-SIG] Proposal: feedparser.py stream large attachments Message-ID: I'd like to propose a backwards compatible change for feedparser.py which optionally allows streaming of Message() payloads via subclassed message.Message objects. Currently, storing of message payload is implemented by creating a local list in FeedParser(), appending incoming lines to that list and finally joining that list to set it as payload inside the Message() object [1]. This may work (and actually may be desirable) for smaller payloads. Once one starts dealing with payloads bigger than, lets say, 20 MB it becomes less practical. Not even talking about what happens with 500 MB payloads. A 20 MB Base64 encoded payload (where encoding adds about 33 %) costs: 1) about (20 * 1.33) MB memory inside the local list at FeedParser() 2) another (20 * 1.33) MB RAM once it is set via self._cur.set_payload() The first will be garbage collected at some point but before that both are kept in memory at the same time. Once the user requests this payload with get_payload(decode=True) it is again held two times in memory: 3) in its encoded form (20 * 1.33) MB via self._cur._payload 4) in its decoded binary form 20 MB returned from various wrappers around binascii.a2b_base64() Thus, it would be useful to have an (optional) way to stream (and decode/store) payloads so they are never held in memory at once. As the FeedParser supports a _factory keyword to use other kind of Message objects, 3) and 4) could be solved by rewriting the set_payload() and get_payload() callables in a subclassed Message class. Sadly this won't help much as 1) and 2) are buried deep inside the FeedParser itself. Another drawback of rewriting set_payload and get_payload in a subclass is that code may be out of sync with the installed email.message.Message class. Following is a possible solution I propose to overcome this issues. It consists of two parts which should be compatible with a non changed Message class and thus a default FeedParser() instance. 1) Allow for optional keywords to be passed to FeedParser() constructor. They will be saved and then passed to new Message() objects. 2) A new streaming interface for Message objects. It consists of 3 additional callables: 2.1) start_payload_chunks() 2.2) append_payload_chunk(line) 2.3) finalize_payload() The patch [2] first checks for the availability of the new streaming interface and falls back to the old code. This should allow the FeedParser to work with existing subclasses of message.Message as well as with new subclasses which implement the streaming interface. Please note this diff is based on Python 2.7 as shipped with Debian Wheezy. The current implementation uses the same problematic code for 2) though (see [1]). I will post another message containing a simple use case for the new interface which only streams, decodes and stores base64 encoded payloads on the fly and uses the old method for everything else. It additionally uses two more callables inside its Message subclass: get_payload_file() and is_streamed(). It also contains some comments about unresolved issues like how decoding errors should be properly dealt with. And who is responsible for catching exceptions raised by the new interfaces so they can't break the FeedParser itself. This patch is mostly designed to present my idea to work around the current "All-in-RAM" situation (in Python 2 and Python 3). Comments, critics and suggestions on how to proceed to have a feature like this merged (and in the best case also be backported to Python 2) are more than welcome. [1] https://hg.python.org/cpython/file/78986c99dd6c/Lib/email/feedparser.py#l462 [2] feedparser.diff: --- /usr/lib/python2.7/email/feedparser.py 2014-03-13 10:54:56.000000000 +0000 +++ feedparser_stream.py 2015-05-25 03:02:09.000000000 +0000 @@ -137,9 +137,10 @@ class FeedParser: """A feed-style parser of email.""" - def __init__(self, _factory=message.Message): + def __init__(self, _factory=message.Message, **kwargs): """_factory is called with no arguments to create a new message obj""" self._factory = _factory + self._factory_kwargs = kwargs self._input = BufferedSubFile() self._msgstack = [] self._parse = self._parsegen().next @@ -175,7 +176,7 @@ return root def _new_message(self): - msg = self._factory() + msg = self._factory(**self._factory_kwargs) if self._cur and self._cur.get_content_type() == 'multipart/digest': msg.set_default_type('message/rfc822') if self._msgstack: @@ -420,6 +421,22 @@ return # Otherwise, it's some non-multipart type, so the entire rest of the # file contents becomes the payload. + + # Test for message streaming interface + if hasattr(self._cur, 'start_payload_chunks') \ + and callable(self._cur.start_payload_chunks): + _cur = self._cur + _cur.start_payload_chunks() + for line in self._input: + if line is NeedMoreData: + yield NeedMoreData + continue + _cur.append_payload_chunk(line) + _cur.finalize_payload() + return + + # Streaming interface not available + # Fall back to legacy all in RAM solution. lines = [] for line in self._input: if line is NeedMoreData: From pytonic at i2pmail.org Mon May 25 23:33:21 2015 From: pytonic at i2pmail.org (PyTonic) Date: Mon, 25 May 2015 21:33:21 +0000 Subject: [Email-SIG] use_case.py In-Reply-To: References: Message-ID: On 05/25/2015 07:04 PM, PyTonic wrote: > I will post another message containing a simple use case for the new > interface which only streams, decodes and stores base64 encoded > payloads on the fly and uses the old method for everything else. It > additionally uses two more callables inside its Message subclass: > get_payload_file() and is_streamed(). > > It also contains some comments about unresolved issues like how > decoding errors should be properly dealt with. And who is responsible > for catching exceptions raised by the new interfaces so they can't > break the FeedParser itself. Attached as use_case.txt -------------- next part -------------- from email import message from feedparser_stream import FeedParser class OldMessageList(message.Message): ''' Same as message.Message ''' def start_payload_chunks(self): self._payload_buffer = list() def append_payload_chunk(self, line): self._payload_buffer.append(line) def finalize_payload(self): self.set_payload(''.join(self._payload_buffer)) del self._payload_buffer from cStringIO import StringIO class OldMessagecStringIO(message.Message): ''' Same as message.Message but using cStringIO instead of list() ''' def start_payload_chunks(self): self._payload_buffer = StringIO() def append_payload_chunk(self, line): self._payload_buffer.write(line) def finalize_payload(self): self.set_payload(self._payload_buffer.getvalue()) self._payload_buffer.close() del self._payload_buffer from binascii import a2b_base64 class StreamAndDecodeOnlyBase64Message(message.Message): def __init__(self, *args, **kwargs): ''' This class does almost everything like the default message.Message class but will decode and store base64 payloads on the fly without storing the full payload in RAM (twice). If CTE is not set to base64 it should behave as usual. For this to somewhat work: 1) one should be able to pass kwargs to FeedParser(). FeedParser then passes those to the supplied factory class when creating new message objects. 2) FeedParser has to call the message object to add new payload lines instead of adding them locally and then setting a str. See changes in feedparser.py for an experimental version. There are some things left out like catching decoding exceptions and adding those to defects. Its also unclear how to proceed in such a situation. Currently message.Message silently delivers the encoded parts in get_payload() if decoding fails. This is not the right thing to do if a user requests decoding. There is also no check if start_payoad_chunks() was actually called before appending new lines. ''' self._create_tmp_file = kwargs.pop( 'tmp_file_creator', lambda msg: open('/tmp/bad_fallback', 'r+b') ) message.Message.__init__(self, *args, **kwargs) def start_payload_chunks(self): if self.get('content-transfer-encoding', '').lower() == 'base64': assert(callable(self._create_tmp_file)) self._payload_file = self._create_tmp_file(self) self._payload_file_start = self._payload_file.tell() self._left_over = '' self._is_base64 = True self.append_payload_chunk = self._append_payload_chunk_file else: self._payload_buffer = list() self._is_base64 = False self.append_payload_chunk = self._append_payload_chunk_memory def _append_payload_chunk_memory(self, line): self._payload_buffer.append(line) def _append_payload_chunk_file(self, line): # Base64 specific line = self._left_over + line.rstrip() mod = (len(line) % 4) * -1 if mod != 0: line = line[:mod] self._left_over = line[mod:] self._payload_file.write(a2b_base64(line)) if mod == 0: self._left_over = '' def finalize_payload(self): if not self._is_base64: self.set_payload(''.join(self._payload_buffer)) del self._payload_buffer else: ''' It is unclear to me how get_payload() could be modified to deliver either a filename or File object in this case without breaking existing code. Regardless of this it should *not* hold the full decoded content in memory. len(self._left_over) > 0 should raise a decoding exception which should be added to defects. Not sure where, here or within FeedParser(). See last part of __init__ comment. ''' self._payload_file.seek(self._payload_file_start) self.set_payload('') def is_streamed(self): return self._is_base64 def get_payload_file(self): assert(self._is_base64) return self._payload_file if __name__ == '__main__': import sys from hashlib import md5 def _create_temporary_file(msg): ''' Just some test dummy ''' return open('/tmp/some.very_large_payload', 'r+b') def show_parts(msg, level=0): ''' Just some debugging dummy ''' _fmt = "{sp:\t<{lvl}}{part}: {mime} as {charset} via {encoding}:\t{hash}" _fmt_kw = { 'sp': '', 'lvl': level, 'mime': msg.get_content_type(), 'charset': msg.get_content_charset(), 'encoding': msg.get('content-transfer-encoding', 'unknown') } if msg.is_multipart(): print _fmt.format(part='multipart', hash='', **_fmt_kw) for part in msg.get_payload(): show_parts(part, level=level+1) return if hasattr(msg, 'is_streamed') and msg.is_streamed(): _checksum = md5() with msg.get_payload_file() as _payload_file: while True: _chunk = _payload_file.read(_checksum.block_size) if len(_chunk) == 0: break _checksum.update(_chunk) _checksum = _checksum.hexdigest() + ' (streamed)' else: _checksum = md5(msg.get_payload(decode=True)).hexdigest() print _fmt.format(part='single part', hash=_checksum, **_fmt_kw) # Init two parser instances stream_parser = FeedParser( _factory=StreamAndDecodeOnlyBase64Message, tmp_file_creator=_create_temporary_file ) default_parser = FeedParser() # And test for name, parser in (('default', default_parser), ('stream', stream_parser)): print "\nUsing %s parser:" % name with open(sys.argv[1], 'rb') as some_largish_mime_message: for line in some_largish_mime_message: parser.feed(line) msg = parser.close() show_parts(msg, level=1) From rdmurray at bitdance.com Tue May 26 16:11:48 2015 From: rdmurray at bitdance.com (R. David Murray) Date: Tue, 26 May 2015 10:11:48 -0400 Subject: [Email-SIG] use_case.py In-Reply-To: References: Message-ID: <20150526141148.CBC7DB18087@webabinitio.net> On Mon, 25 May 2015 21:33:21 -0000, PyTonic wrote: > On 05/25/2015 07:04 PM, PyTonic wrote: > > I will post another message containing a simple use case for the new > > interface which only streams, decodes and stores base64 encoded > > payloads on the fly and uses the old method for everything else. It > > additionally uses two more callables inside its Message subclass: > > get_payload_file() and is_streamed(). > > > > It also contains some comments about unresolved issues like how > > decoding errors should be properly dealt with. And who is responsible > > for catching exceptions raised by the new interfaces so they can't > > break the FeedParser itself. It is great that you are interested in working on this. Providing a way to process large emails without the current crazy memory consumption is a goal of mine, and we'll happily work with you toward making that a reality. However, anything along these lines is going to be a new feature, and therefore can only target 3.6 at this point, so any patch proposals need to be against the default branch of the cpython repository. In Python3 we now have the policy framework. I'm pretty sure it makes sense to leverage that for the new internal API. I agree that feedparser itself will need some changes in order to make all this work correctly. (Also, feedparser has gotten a couple of non-trivial performance enhancements in python3, so some of the code is different.) Note that the generator needs similar changes, and that problem may be much harder to solve, since the current algorithm is recursive and holds *everything* in memory. In addition, it seems like this would be a natural (even necessary?) place to introduce the 'store bodies on disk' interface that we've wanted for quite a while. Can you take a look at the policy framework and reformulate your proposal in light of that? We can certainly work on this one piece at a time, we just want to keep all of the moving parts in mind while we do so... --David