From pytonic at i2pmail.org  Mon May 25 21:04:34 2015
From: pytonic at i2pmail.org (PyTonic)
Date: Mon, 25 May 2015 19:04:34 +0000
Subject: [Email-SIG] Proposal: feedparser.py stream large attachments
Message-ID: <mjvro3$f41$1@ger.gmane.org>

I'd like to propose a backwards compatible change for feedparser.py
which optionally allows streaming of Message() payloads via subclassed
message.Message objects.

Currently, storing of message payload is implemented by creating a
local list in FeedParser(), appending incoming lines to that list and
finally joining that list to set it as payload inside the Message()
object [1].

This may work (and actually may be desirable) for smaller payloads.
Once one starts dealing with payloads bigger than, lets say, 20 MB it
becomes less practical. Not even talking about what happens with 500 MB
payloads.

A 20 MB Base64 encoded payload (where encoding adds about 33 %) costs:
1) about (20 * 1.33) MB memory inside the local list at FeedParser()
2) another (20 * 1.33) MB RAM once it is set via self._cur.set_payload()
The first will be garbage collected at some point but before that both
are kept in memory at the same time.

Once the user requests this payload with get_payload(decode=True) it is
again held two times in memory:
3) in its encoded form (20 * 1.33) MB via self._cur._payload
4) in its decoded binary form 20 MB returned from various wrappers
    around binascii.a2b_base64()

Thus, it would be useful to have an (optional) way to stream
(and decode/store) payloads so they are never held in memory at once.
As the FeedParser supports a _factory keyword to use other kind of
Message objects, 3) and 4) could be solved by rewriting the
set_payload() and get_payload() callables in a subclassed Message class.

Sadly this won't help much as 1) and 2) are buried deep inside the
FeedParser itself. Another drawback of rewriting set_payload and
get_payload in a subclass is that code may be out of sync with the
installed email.message.Message class.


Following is a possible solution I propose to overcome this issues.
It consists of two parts which should be compatible with a non changed
Message class and thus a default FeedParser() instance.

1) Allow for optional keywords to be passed to FeedParser() constructor.
    They will be saved and then passed to new Message() objects.

2) A new streaming interface for Message objects.
    It consists of 3 additional callables:
      2.1) start_payload_chunks()
      2.2) append_payload_chunk(line)
      2.3) finalize_payload()

The patch [2] first checks for the availability of the new streaming
interface and falls back to the old code. This should allow the
FeedParser to work with existing subclasses of message.Message as well
as with new subclasses which implement the streaming interface.

Please note this diff is based on Python 2.7 as shipped with
Debian Wheezy. The current implementation uses the same problematic
code for 2) though (see [1]).

I will post another message containing a simple use case for the new
interface which only streams, decodes and stores base64 encoded
payloads on the fly and uses the old method for everything else. It
additionally uses two more callables inside its Message subclass:
get_payload_file() and is_streamed().

It also contains some comments about unresolved issues like how
decoding errors should be properly dealt with. And who is responsible
for catching exceptions raised by the new interfaces so they can't
break the FeedParser itself.

This patch is mostly designed to present my idea to work around the
current "All-in-RAM" situation (in Python 2 and Python 3).

Comments, critics and suggestions on how to proceed to have a feature
like this merged (and in the best case also be backported to Python 2)
are more than welcome.


[1] 
https://hg.python.org/cpython/file/78986c99dd6c/Lib/email/feedparser.py#l462
[2] feedparser.diff:

--- /usr/lib/python2.7/email/feedparser.py	2014-03-13 10:54:56.000000000 
+0000
+++ feedparser_stream.py	2015-05-25 03:02:09.000000000 +0000
@@ -137,9 +137,10 @@
  class FeedParser:
      """A feed-style parser of email."""

-    def __init__(self, _factory=message.Message):
+    def __init__(self, _factory=message.Message, **kwargs):
          """_factory is called with no arguments to create a new 
message obj"""
          self._factory = _factory
+        self._factory_kwargs = kwargs
          self._input = BufferedSubFile()
          self._msgstack = []
          self._parse = self._parsegen().next
@@ -175,7 +176,7 @@
          return root

      def _new_message(self):
-        msg = self._factory()
+        msg = self._factory(**self._factory_kwargs)
          if self._cur and self._cur.get_content_type() == 
'multipart/digest':
              msg.set_default_type('message/rfc822')
          if self._msgstack:
@@ -420,6 +421,22 @@
              return
          # Otherwise, it's some non-multipart type, so the entire rest 
of the
          # file contents becomes the payload.
+
+        # Test for message streaming interface
+        if hasattr(self._cur, 'start_payload_chunks') \
+        and callable(self._cur.start_payload_chunks):
+            _cur = self._cur
+            _cur.start_payload_chunks()
+            for line in self._input:
+                if line is NeedMoreData:
+                    yield NeedMoreData
+                    continue
+                _cur.append_payload_chunk(line)
+            _cur.finalize_payload()
+            return
+
+        # Streaming interface not available
+        # Fall back to legacy all in RAM solution.
          lines = []
          for line in self._input:
              if line is NeedMoreData:


From pytonic at i2pmail.org  Mon May 25 23:33:21 2015
From: pytonic at i2pmail.org (PyTonic)
Date: Mon, 25 May 2015 21:33:21 +0000
Subject: [Email-SIG] use_case.py
In-Reply-To: <mjvro3$f41$1@ger.gmane.org>
References: <mjvro3$f41$1@ger.gmane.org>
Message-ID: <mk04f4$d4d$1@ger.gmane.org>

On 05/25/2015 07:04 PM, PyTonic wrote:
> I will post another message containing a simple use case for the new
> interface which only streams, decodes and stores base64 encoded
> payloads on the fly and uses the old method for everything else. It
> additionally uses two more callables inside its Message subclass:
> get_payload_file() and is_streamed().
>
> It also contains some comments about unresolved issues like how
> decoding errors should be properly dealt with. And who is responsible
> for catching exceptions raised by the new interfaces so they can't
> break the FeedParser itself.

Attached as use_case.txt
-------------- next part --------------
from email import message
from feedparser_stream import FeedParser

class OldMessageList(message.Message):
  ''' Same as message.Message '''
  def start_payload_chunks(self):
    self._payload_buffer = list()

  def append_payload_chunk(self, line):
    self._payload_buffer.append(line)

  def finalize_payload(self):
    self.set_payload(''.join(self._payload_buffer))
    del self._payload_buffer


from cStringIO import StringIO
class OldMessagecStringIO(message.Message):
  '''
    Same as message.Message but using
    cStringIO instead of list()
  '''
  def start_payload_chunks(self):
    self._payload_buffer = StringIO()

  def append_payload_chunk(self, line):
    self._payload_buffer.write(line)

  def finalize_payload(self):
    self.set_payload(self._payload_buffer.getvalue())
    self._payload_buffer.close()
    del self._payload_buffer


from binascii import a2b_base64
class StreamAndDecodeOnlyBase64Message(message.Message):
  def __init__(self, *args, **kwargs):
    '''
      This class does almost everything like the default
      message.Message class but will decode and store base64 payloads
      on the fly without storing the full payload in RAM (twice). If
      CTE is not set to base64 it should behave as usual.

      For this to somewhat work:
      1) one should be able to pass kwargs to FeedParser().
         FeedParser then passes those to the supplied factory class
         when creating new message objects.
      2) FeedParser has to call the message object to add new payload
         lines instead of adding them locally and then setting a str.

      See changes in feedparser.py for an experimental version.

      There are some things left out like catching decoding
      exceptions and adding those to defects. Its also unclear how to
      proceed in such a situation. Currently message.Message silently
      delivers the encoded parts in get_payload() if decoding fails.
      This is not the right thing to do if a user requests decoding.

      There is also no check if start_payoad_chunks() was actually
      called before appending new lines.
    '''
    self._create_tmp_file = kwargs.pop(
      'tmp_file_creator', lambda msg: open('/tmp/bad_fallback', 'r+b')
    )
    message.Message.__init__(self, *args, **kwargs)

  def start_payload_chunks(self):
    if self.get('content-transfer-encoding', '').lower() == 'base64':
      assert(callable(self._create_tmp_file))
      self._payload_file = self._create_tmp_file(self)
      self._payload_file_start = self._payload_file.tell()
      self._left_over = ''
      self._is_base64 = True
      self.append_payload_chunk = self._append_payload_chunk_file
    else:
      self._payload_buffer = list()
      self._is_base64 = False
      self.append_payload_chunk = self._append_payload_chunk_memory

  def _append_payload_chunk_memory(self, line):
      self._payload_buffer.append(line)

  def _append_payload_chunk_file(self, line):
    # Base64 specific
    line = self._left_over + line.rstrip()
    mod = (len(line) % 4) * -1
    if mod != 0:
      line = line[:mod]
      self._left_over = line[mod:]

    self._payload_file.write(a2b_base64(line))
    if mod == 0:
      self._left_over = ''

  def finalize_payload(self):
    if not self._is_base64:
      self.set_payload(''.join(self._payload_buffer))
      del self._payload_buffer
    else:
      '''
        It is unclear to me how get_payload() could
        be modified to deliver either a filename or
        File object in this case without breaking
        existing code. Regardless of this it should
        *not* hold the full decoded content in memory.
        
        len(self._left_over) > 0 should raise a decoding exception
        which should be added to defects. Not sure where, here or
        within FeedParser(). See last part of __init__ comment.  
      '''
      self._payload_file.seek(self._payload_file_start)
      self.set_payload('')

  def is_streamed(self):
    return self._is_base64

  def get_payload_file(self):
    assert(self._is_base64)
    return self._payload_file


if __name__ == '__main__':
  import sys
  from hashlib import md5
  
  def _create_temporary_file(msg):
    ''' Just some test dummy '''
    return open('/tmp/some.very_large_payload', 'r+b')

  def show_parts(msg, level=0):
    ''' Just some debugging dummy '''
    _fmt = "{sp:\t<{lvl}}{part}: {mime} as {charset} via {encoding}:\t{hash}"
    _fmt_kw = {
            'sp': '',
           'lvl': level,
          'mime': msg.get_content_type(),
       'charset': msg.get_content_charset(),
      'encoding': msg.get('content-transfer-encoding', 'unknown')
    }
    if msg.is_multipart():
      print _fmt.format(part='multipart', hash='', **_fmt_kw)
      for part in msg.get_payload():
        show_parts(part, level=level+1)
      return

    if hasattr(msg, 'is_streamed') and msg.is_streamed():
      _checksum = md5()
      with msg.get_payload_file() as _payload_file:
        while True:
          _chunk = _payload_file.read(_checksum.block_size)
          if len(_chunk) == 0:
            break
          _checksum.update(_chunk)
      _checksum = _checksum.hexdigest() + ' (streamed)'
    else:
      _checksum = md5(msg.get_payload(decode=True)).hexdigest()
    print _fmt.format(part='single part', hash=_checksum, **_fmt_kw)


  # Init two parser instances
  stream_parser = FeedParser(
    _factory=StreamAndDecodeOnlyBase64Message,
    tmp_file_creator=_create_temporary_file
  )
  default_parser = FeedParser()

  # And test
  for name, parser in (('default', default_parser), ('stream', stream_parser)):
    print "\nUsing %s parser:" % name
    with open(sys.argv[1], 'rb') as some_largish_mime_message:
      for line in some_largish_mime_message:
        parser.feed(line)
    msg = parser.close()
    show_parts(msg, level=1)

From rdmurray at bitdance.com  Tue May 26 16:11:48 2015
From: rdmurray at bitdance.com (R. David Murray)
Date: Tue, 26 May 2015 10:11:48 -0400
Subject: [Email-SIG] use_case.py
In-Reply-To: <mk04f4$d4d$1@ger.gmane.org>
References: <mjvro3$f41$1@ger.gmane.org> <mk04f4$d4d$1@ger.gmane.org>
Message-ID: <20150526141148.CBC7DB18087@webabinitio.net>

On Mon, 25 May 2015 21:33:21 -0000, PyTonic <pytonic at i2pmail.org> wrote:
> On 05/25/2015 07:04 PM, PyTonic wrote:
> > I will post another message containing a simple use case for the new
> > interface which only streams, decodes and stores base64 encoded
> > payloads on the fly and uses the old method for everything else. It
> > additionally uses two more callables inside its Message subclass:
> > get_payload_file() and is_streamed().
> >
> > It also contains some comments about unresolved issues like how
> > decoding errors should be properly dealt with. And who is responsible
> > for catching exceptions raised by the new interfaces so they can't
> > break the FeedParser itself.

It is great that you are interested in working on this.  Providing a way
to process large emails without the current crazy memory consumption is
a goal of mine, and we'll happily work with you toward making that a
reality.  However, anything along these lines is going to be a new
feature, and therefore can only target 3.6 at this point, so any patch
proposals need to be against the default branch of the cpython
repository.

In Python3 we now have the policy framework.  I'm pretty sure it makes
sense to leverage that for the new internal API.  I agree that
feedparser itself will need some changes in order to make all this work
correctly.  (Also, feedparser has gotten a couple of non-trivial
performance enhancements in python3, so some of the code is different.)

Note that the generator needs similar changes, and that problem may be
much harder to solve, since the current algorithm is recursive and holds
*everything* in memory.  In addition, it seems like this would be a
natural (even necessary?) place to introduce the 'store bodies on disk'
interface that we've wanted for quite a while.

Can you take a look at the policy framework and reformulate your proposal
in light of that?  We can certainly work on this one piece at a time,
we just want to keep all of the moving parts in mind while we do so...

--David