[Python-checkins] r85811 - in python/branches/py3k: Doc/library/email.generator.rst Doc/library/email.header.rst Lib/email/generator.py Lib/email/header.py Lib/email/test/data/msg_26.txt Lib/email/test/test_email.py Misc/NEWS

Sun Oct 24 00:19:57 CEST 2010

Author: r.david.murray
Date: Sun Oct 24 00:19:56 2010
New Revision: 85811

Log:
#1349106: add linesep argument to generator.flatten and header.encode.


Modified:
   python/branches/py3k/Doc/library/email.generator.rst
   python/branches/py3k/Doc/library/email.header.rst
   python/branches/py3k/Lib/email/generator.py
   python/branches/py3k/Lib/email/header.py
   python/branches/py3k/Lib/email/test/data/msg_26.txt
   python/branches/py3k/Lib/email/test/test_email.py
   python/branches/py3k/Misc/NEWS

Modified: python/branches/py3k/Doc/library/email.generator.rst
==============================================================================

--- python/branches/py3k/Doc/library/email.generator.rst	(original)
+++ python/branches/py3k/Doc/library/email.generator.rst	Sun Oct 24 00:19:56 2010
@@ -56,7 +56,7 @@
    The other public :class:`Generator` methods are:
 
 
-   .. method:: flatten(msg, unixfrom=False)
+   .. method:: flatten(msg, unixfrom=False, linesep='\\n')
 
       Print the textual representation of the message object structure rooted at
       *msg* to the output file specified when the :class:`Generator` instance
@@ -71,12 +71,20 @@
 
       Note that for subparts, no envelope header is ever printed.
 
+      Optional *linesep* specifies the line separator character used to
+      terminate lines in the output.  It defaults to ``\n`` because that is
+      the most useful value for Python application code (other library packages
+      expect ``\n`` separated lines).  ``linesep=\r\n`` can be used to
+      generate output with RFC-compliant line separators.
+
       Messages parsed with a Bytes parser that have a
       :mailheader:`Content-Transfer-Encoding` of 8bit will be converted to a
       use a 7bit Content-Transfer-Encoding.  Any other non-ASCII bytes in the
       message structure will be converted to '?' characters.
 
-      .. versionchanged:: 3.2 added support for re-encoding 8bit message bodies.
+      .. versionchanged:: 3.2
+         added support for re-encoding 8bit message bodies, and the linesep
+         argument
 
    .. method:: clone(fp)
 
@@ -97,16 +105,70 @@
 
 .. class:: BytesGenerator(outfp, mangle_from_=True, maxheaderlen=78)
 
-   This class has the same API as the :class:`Generator` class, except that
-   *outfp* must be a file like object that will accept :class`bytes` input to
-   its ``write`` method.  If the message object structure contains non-ASCII
-   bytes, this generator's :meth:`~BytesGenerator.flatten` method will produce
-   them as-is, including preserving parts with a
-   :mailheader:`Content-Transfer-Encoding` of ``8bit``.
-
-   Note that even the :meth:`write` method API is identical:  it expects
-   strings as input, and converts them to bytes by encoding them using
-   the ASCII codec.
+   The constructor for the :class:`BytesGenerator` class takes a binary
+   :term:`file-like object` called *outfp* for an argument.  *outfp* must
+   support a :meth:`write` method that accepts binary data.
+
+   Optional *mangle_from_* is a flag that, when ``True``, puts a ``>``
+   character in front of any line in the body that starts exactly as ``From``,
+   i.e. ``From`` followed by a space at the beginning of the line.  This is the
+   only guaranteed portable way to avoid having such lines be mistaken for a
+   Unix mailbox format envelope header separator (see `WHY THE CONTENT-LENGTH
+   FORMAT IS BAD <http://www.jwz.org/doc/content-length.html>`_ for details).
+   *mangle_from_* defaults to ``True``, but you might want to set this to
+   ``False`` if you are not writing Unix mailbox format files.
+
+   Optional *maxheaderlen* specifies the longest length for a non-continued
+   header.  When a header line is longer than *maxheaderlen* (in characters,
+   with tabs expanded to 8 spaces), the header will be split as defined in the
+   :class:`~email.header.Header` class.  Set to zero to disable header
+   wrapping.  The default is 78, as recommended (but not required) by
+   :rfc:`2822`.
+
+   The other public :class:`BytesGenerator` methods are:
+
+
+   .. method:: flatten(msg, unixfrom=False, linesep='\n')
+
+      Print the textual representation of the message object structure rooted
+      at *msg* to the output file specified when the :class:`BytesGenerator`
+      instance was created.  Subparts are visited depth-first and the resulting
+      text will be properly MIME encoded.  If the input that created the *msg*
+      contained bytes with the high bit set and those bytes have not been
+      modified, they will be copied faithfully to the output, even if doing so
+      is not strictly RFC compliant.  (To produce strictly RFC compliant
+      output, use the :class:`Generator` class.)
+
+      Messages parsed with a Bytes parser that have a
+      :mailheader:`Content-Transfer-Encoding` of 8bit will be reconstructed
+      as 8bit if they have not been modified.
+
+      Optional *unixfrom* is a flag that forces the printing of the envelope
+      header delimiter before the first :rfc:`2822` header of the root message
+      object.  If the root object has no envelope header, a standard one is
+      crafted.  By default, this is set to ``False`` to inhibit the printing of
+      the envelope delimiter.
+
+      Note that for subparts, no envelope header is ever printed.
+
+      Optional *linesep* specifies the line separator character used to
+      terminate lines in the output.  It defaults to ``\n`` because that is
+      the most useful value for Python application code (other library packages
+      expect ``\n`` separated lines).  ``linesep=\r\n`` can be used to
+      generate output with RFC-compliant line separators.
+
+   .. method:: clone(fp)
+
+      Return an independent clone of this :class:`BytesGenerator` instance with
+      the exact same options.
+
+   .. method:: write(s)
+
+      Write the string *s* to the underlying file object.  *s* is encoded using
+      the ``ASCII`` codec and written to the *write* method of the  *outfp*
+      *outfp* passed to the :class:`BytesGenerator`'s constructor.  This
+      provides just enough file-like API for :class:`BytesGenerator` instances
+      to be used in the :func:`print` function.
 
    .. versionadded:: 3.2
 

Modified: python/branches/py3k/Doc/library/email.header.rst
==============================================================================
--- python/branches/py3k/Doc/library/email.header.rst	(original)
+++ python/branches/py3k/Doc/library/email.header.rst	Sun Oct 24 00:19:56 2010
@@ -104,7 +104,7 @@
       :func:`ustr.encode` call, and defaults to "strict".
 
 
-   .. method:: encode(splitchars=';, \\t', maxlinelen=None)
+   .. method:: encode(splitchars=';, \\t', maxlinelen=None, linesep='\\n')
 
       Encode a message header into an RFC-compliant format, possibly wrapping
       long lines and encapsulating non-ASCII parts in base64 or quoted-printable
@@ -115,6 +115,13 @@
       *maxlinelen*, if given, overrides the instance's value for the maximum
       line length.
 
+      *linesep* specifies the characters used to separate the lines of the
+      folded header.  It defaults to the most useful value for Python
+      application code (``\n``), but ``\r\n`` can be specified in order
+      to produce headers with RFC-compliant line separators.
+
+      .. versionchanged:: 3.2 added the linesep argument
+
 
    The :class:`Header` class also provides a number of methods to support
    standard operators and built-in functions.

Modified: python/branches/py3k/Lib/email/generator.py
==============================================================================
--- python/branches/py3k/Lib/email/generator.py	(original)
+++ python/branches/py3k/Lib/email/generator.py	Sun Oct 24 00:19:56 2010
@@ -17,7 +17,7 @@
 from email.message import _has_surrogates
 
 UNDERSCORE = '_'
-NL = '\n'
+NL = '\n'  # XXX: no longer used by the code below.
 
 fcre = re.compile(r'^From ', re.MULTILINE)
 
@@ -58,7 +58,7 @@
         # Just delegate to the file object
         self._fp.write(s)
 
-    def flatten(self, msg, unixfrom=False):
+    def flatten(self, msg, unixfrom=False, linesep='\n'):
         """Print the message object tree rooted at msg to the output file
         specified when the Generator instance was created.
 
@@ -68,12 +68,23 @@
         is False to inhibit the printing of any From_ delimiter.
 
         Note that for subobjects, no From_ line is printed.
+
+        linesep specifies the characters used to indicate a new line in
+        the output.
         """
+        # We use the _XXX constants for operating on data that comes directly
+        # from the msg, and _encoded_XXX constants for operating on data that
+        # has already been converted (to bytes in the BytesGenerator) and
+        # inserted into a temporary buffer.
+        self._NL = linesep
+        self._encoded_NL = self._encode(linesep)
+        self._EMPTY = ''
+        self._encoded_EMTPY = self._encode('')
         if unixfrom:
             ufrom = msg.get_unixfrom()
             if not ufrom:
                 ufrom = 'From nobody ' + time.ctime(time.time())
-            self.write(ufrom + NL)
+            self.write(ufrom + self._NL)
         self._write(msg)
 
     def clone(self, fp):
@@ -93,20 +104,18 @@
     # it has already transformed the input; but, since this whole thing is a
     # hack anyway this seems good enough.
 
-    # We use these class constants when we need to manipulate data that has
-    # already been written to a buffer (ex: constructing a re to check the
-    # boundary), and the module level NL constant when adding new output to a
-    # buffer via self.write, because 'write' always takes strings.
-    # Having write always take strings makes the code simpler, but there are
-    # a few occasions when we need to write previously created data back
-    # to the buffer or to a new buffer; for those cases we use self._fp.write.
-    _NL = NL
-    _EMPTY = ''
+    # Similarly, we have _XXX and _encoded_XXX attributes that are used on
+    # source and buffer data, respectively.
+    _encoded_EMPTY = ''
 
     def _new_buffer(self):
         # BytesGenerator overrides this to return BytesIO.
         return StringIO()
 
+    def _encode(self, s):
+        # BytesGenerator overrides this to encode strings to bytes.
+        return s
+
     def _write(self, msg):
         # We can't write the headers yet because of the following scenario:
         # say a multipart message includes the boundary string somewhere in
@@ -158,14 +167,15 @@
         for h, v in msg.items():
             self.write('%s: ' % h)
             if isinstance(v, Header):
-                self.write(v.encode(maxlinelen=self._maxheaderlen)+NL)
+                self.write(v.encode(
+                    maxlinelen=self._maxheaderlen, linesep=self._NL)+self._NL)
             else:
                 # Header's got lots of smarts, so use it.
                 header = Header(v, maxlinelen=self._maxheaderlen,
                                 header_name=h)
-                self.write(header.encode()+NL)
+                self.write(header.encode(linesep=self._NL)+self._NL)
         # A blank line always separates headers from body
-        self.write(NL)
+        self.write(self._NL)
 
     #
     # Handlers for writing types and subtypes
@@ -208,11 +218,11 @@
         for part in subparts:
             s = self._new_buffer()
             g = self.clone(s)
-            g.flatten(part, unixfrom=False)
+            g.flatten(part, unixfrom=False, linesep=self._NL)
             msgtexts.append(s.getvalue())
         # Now make sure the boundary we've selected doesn't appear in any of
         # the message texts.
-        alltext = self._NL.join(msgtexts)
+        alltext = self._encoded_NL.join(msgtexts)
         # BAW: What about boundaries that are wrapped in double-quotes?
         boundary = msg.get_boundary(failobj=self._make_boundary(alltext))
         # If we had to calculate a new boundary because the body text
@@ -225,9 +235,9 @@
             msg.set_boundary(boundary)
         # If there's a preamble, write it out, with a trailing CRLF
         if msg.preamble is not None:
-            self.write(msg.preamble + NL)
+            self.write(msg.preamble + self._NL)
         # dash-boundary transport-padding CRLF
-        self.write('--' + boundary + NL)
+        self.write('--' + boundary + self._NL)
         # body-part
         if msgtexts:
             self._fp.write(msgtexts.pop(0))
@@ -236,13 +246,13 @@
         # --> CRLF body-part
         for body_part in msgtexts:
             # delimiter transport-padding CRLF
-            self.write('\n--' + boundary + NL)
+            self.write(self._NL + '--' + boundary + self._NL)
             # body-part
             self._fp.write(body_part)
         # close-delimiter transport-padding
-        self.write('\n--' + boundary + '--')
+        self.write(self._NL + '--' + boundary + '--')
         if msg.epilogue is not None:
-            self.write(NL)
+            self.write(self._NL)
             self.write(msg.epilogue)
 
     def _handle_multipart_signed(self, msg):
@@ -266,16 +276,16 @@
             g = self.clone(s)
             g.flatten(part, unixfrom=False)
             text = s.getvalue()
-            lines = text.split(self._NL)
+            lines = text.split(self._encoded_NL)
             # Strip off the unnecessary trailing empty line
-            if lines and lines[-1] == self._EMPTY:
-                blocks.append(self._NL.join(lines[:-1]))
+            if lines and lines[-1] == self._encoded_EMPTY:
+                blocks.append(self._encoded_NL.join(lines[:-1]))
             else:
                 blocks.append(text)
         # Now join all the blocks with an empty line.  This has the lovely
         # effect of separating each block with an empty line, but not adding
         # an extra one after the last one.
-        self._fp.write(self._NL.join(blocks))
+        self._fp.write(self._encoded_NL.join(blocks))
 
     def _handle_message(self, msg):
         s = self._new_buffer()
@@ -333,10 +343,9 @@
     The outfp object must accept bytes in its write method.
     """
 
-    # Bytes versions of these constants for use in manipulating data from
+    # Bytes versions of this constant for use in manipulating data from
     # the BytesIO buffer.
-    _NL = NL.encode('ascii')
-    _EMPTY = b''
+    _encoded_EMPTY = b''
 
     def write(self, s):
         self._fp.write(s.encode('ascii', 'surrogateescape'))
@@ -344,6 +353,9 @@
     def _new_buffer(self):
         return BytesIO()
 
+    def _encode(self, s):
+        return s.encode('ascii')
+
     def _write_headers(self, msg):
         # This is almost the same as the string version, except for handling
         # strings with 8bit bytes.
@@ -363,9 +375,9 @@
                 # Header's got lots of smarts and this string is safe...
                 header = Header(v, maxlinelen=self._maxheaderlen,
                                 header_name=h)
-                self.write(header.encode()+NL)
+                self.write(header.encode(linesep=self._NL)+self._NL)
         # A blank line always separates headers from body
-        self.write(NL)
+        self.write(self._NL)
 
     def _handle_text(self, msg):
         # If the string has surrogates the original source was bytes, so

Modified: python/branches/py3k/Lib/email/header.py
==============================================================================
--- python/branches/py3k/Lib/email/header.py	(original)
+++ python/branches/py3k/Lib/email/header.py	Sun Oct 24 00:19:56 2010
@@ -272,7 +272,7 @@
         output_string = input_bytes.decode(output_charset, errors)
         self._chunks.append((output_string, charset))
 
-    def encode(self, splitchars=';, \t', maxlinelen=None):
+    def encode(self, splitchars=';, \t', maxlinelen=None, linesep='\n'):
         """Encode a message header into an RFC-compliant format.
 
         There are many issues involved in converting a given string for use in
@@ -293,6 +293,11 @@
         Optional splitchars is a string containing characters to split long
         ASCII lines on, in rough support of RFC 2822's `highest level
         syntactic breaks'.  This doesn't affect RFC 2047 encoded lines.
+
+        Optional linesep is a string to be used to separate the lines of
+        the value.  The default value is the most useful for typical
+        Python applications, but it can be set to \r\n to produce RFC-compliant
+        line separators when needed.
         """
         self._normalize()
         if maxlinelen is None:
@@ -311,7 +316,7 @@
                 if len(lines) > 1:
                     formatter.newline()
             formatter.add_transition()
-        return str(formatter)
+        return formatter._str(linesep)
 
     def _normalize(self):
         # Step 1: Normalize the chunks so that all runs of identical charsets
@@ -342,9 +347,12 @@
         self._lines = []
         self._current_line = _Accumulator(headerlen)
 
-    def __str__(self):
+    def _str(self, linesep):
         self.newline()
-        return NL.join(self._lines)
+        return linesep.join(self._lines)
+
+    def __str__(self):
+        return self._str(NL)
 
     def newline(self):
         end_of_line = self._current_line.pop()

Modified: python/branches/py3k/Lib/email/test/data/msg_26.txt
==============================================================================
--- python/branches/py3k/Lib/email/test/data/msg_26.txt	(original)
+++ python/branches/py3k/Lib/email/test/data/msg_26.txt	Sun Oct 24 00:19:56 2010
@@ -24,7 +24,8 @@
 
 
 --1618492860--2051301190--113853680
-Content-Type: application/riscos; name="clock.bmp,69c"; type=BMP; load=&fff69c4b; exec=&355dd4d1; access=&03
+Content-Type: application/riscos; name="clock.bmp,69c"; type=BMP;
+	load=&fff69c4b; exec=&355dd4d1; access=&03
 Content-Disposition: attachment; filename="clock.bmp"
 Content-Transfer-Encoding: base64
 

Modified: python/branches/py3k/Lib/email/test/test_email.py
==============================================================================
--- python/branches/py3k/Lib/email/test/test_email.py	(original)
+++ python/branches/py3k/Lib/email/test/test_email.py	Sun Oct 24 00:19:56 2010
@@ -77,7 +77,7 @@
         eq(msg.get_all('cc'), ['ccc at zzz.org', 'ddd at zzz.org', 'eee at zzz.org'])
         eq(msg.get_all('xx', 'n/a'), 'n/a')
 
-    def test_getset_charset(self):
+    def TEst_getset_charset(self):
         eq = self.assertEqual
         msg = Message()
         eq(msg.get_charset(), None)
@@ -2600,6 +2600,18 @@
         part2 = msg.get_payload(1)
         eq(part2.get_content_type(), 'application/riscos')
 
+    def test_crlf_flatten(self):
+        # Using newline='\n' preserves the crlfs in this input file.
+        with openfile('msg_26.txt', newline='\n') as fp:
+            text = fp.read()
+        msg = email.message_from_string(text)
+        s = StringIO()
+        g = Generator(s)
+        g.flatten(msg, linesep='\r\n')
+        self.assertEqual(s.getvalue(), text)
+
+    maxDiff = None
+
     def test_multipart_digest_with_extra_mime_headers(self):
         eq = self.assertEqual
         neq = self.ndiffAssertEqual
@@ -2931,6 +2943,16 @@
         m = bfp.close()
         self.assertEqual(str(m), self.latin_bin_msg_as7bit)
 
+    def test_crlf_flatten(self):
+        with openfile('msg_26.txt', 'rb') as fp:
+            text = fp.read()
+        msg = email.message_from_bytes(text)
+        s = BytesIO()
+        g = email.generator.BytesGenerator(s)
+        g.flatten(msg, linesep='\r\n')
+        self.assertEqual(s.getvalue(), text)
+    maxDiff = None
+
 
 class TestBytesGeneratorIdempotent(TestIdempotent):
 

Modified: python/branches/py3k/Misc/NEWS
==============================================================================
--- python/branches/py3k/Misc/NEWS	(original)
+++ python/branches/py3k/Misc/NEWS	Sun Oct 24 00:19:56 2010
@@ -48,6 +48,9 @@
 Library
 -------
 
+- Issue #1349106: Generator (and BytesGenerator) flatten method and Header
+  encode method now support a 'linesep' argument.
+
 - Issue #5639: Add a *server_hostname* argument to ``SSLContext.wrap_socket``
   in order to support the TLS SNI extension.  ``HTTPSConnection`` and
   ``urlopen()`` also use this argument, so that HTTPS virtual hosts are now