[Python-Dev] Patch making the current email package (mostly) support bytes

Thu Oct 7 02:46:08 CEST 2010

Stephen J. Turnbull <stephen <at> xemacs.org> writes:
> R. David Murray writes:
>  > We're (in the current patch) not punting on handling non-conforming
>  > email, we're punting on handling non-conforming bytes *if the headers
>  > that contain them need to be modified*.  The headers can still be
>  > modified, you just (currently) lose the non-ASCII bytes in the process.
> 
> Modified *or examined*.  I can't think of any important applications
> offhand that *need* to examine the non-ASCII bytes (in particular,
> Mailman doesn't need to do that).  Verbatim copying of the bytes
> themselves is almost always the desired usage.

Mmm.  Yes, or examined.  If we allow escaped bytes to be returned, perhaps
we also should provide a helper that "unescapes" the bytes and returns
the byte string (yes, this is just a call to encode, but by wrapping it
we continue to hide the surrogateescape implementation detail.)

>  > And robustness is not the issue, only extended-beyond-the-RFCs handling
>  > of non-conforming bytes would be an issue.
> 
> And with that, I'm certain that Jon Postel is really dead. 

A goal for email6 is to be *at least* as Postel compliant as email4.
The goal for my patch is to make email5.1 more Postel compliant than
email5.0 is :)

>  > > (Surely you are not saying that Generator.flatten can't DTRT with
>  > > non-ASCII content *at all*?)
>  > 
>  > Yes, that is *exactly* what I am saying:
>  > 
>  > >>> m = email.message_from_string("""\
>  > ... From: pÃ¶stal
>  > ...   
>  > ... """)
>  > >>> str(m)
>  > Traceback (most recent call last):
>  >   ....
>  > UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 1: ordinal not in range(128)
> 
> But that's not interesting; you did that with Python 3.  We want to

Of course I did it with Python3.  It's the Python3 email codebase
I'm working with (and have to work *around*).

> know what people porting from Python 2 will expect.  So, in 2.5.5 or
> 2.6.6 on Mac, with email v4.0.2, it *doesn't* raise, it returns
> 
> wideload:~ 4:14$ python
> Python 2.5.5 (r255:77872, Jul 13 2010, 03:03:57) 
> [GCC 4.0.1 (Apple Inc. build 5490)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import email
> >>> m=email.message_from_string('From: pÃ¶stal\n\n')
> >>> str(m)
> 'From nobody Thu Oct  7 04:18:25 2010\nFrom: p\xc3\xb6stal\n\n'
> >>> m['From']
> 'p\xc3\xb6stal'
> >>> 
> 
> That's hardly helpful!  Surely we can and should do better than that
> now, especially since UTF-8 (with a proper CTE) is now almost
> universally acceptable to MUAs.  When would it be a problem for that
> to return
> 
> 'From nobody Thu Oct  7 04:18:25 2010\nFrom: =?UTF-8?Q?p=C3=B6stal?=\n\n'

What's wrong with that is that when we parse the bytes of the message
we don't know that b'\xc3\xb6' == '=?UTF-8?Q?=C3=B6?='.  It isn't even
all that likely to be true, since I would guess that latin1 is still
more common than utf-8 (but you might know better).

>  > Remember, email5 is a direct translation of email4, and email4 only
>  > handled ASCII and oh-by-the-way-if-there-are-bytes-along-for-the-
>  > -ride-fine-we'll-pass-then-along.  So if you want to put non-ASCII
>  > data into a message you have to encode it properly to ASCII in
>  > exactly the same way that you did in email4:
> 
> But if you do it right, then it will still work in a version that just
> encodes non-ASCII characters in UTF-8 with the appropriate CTE.  Since
> you'll never be passing it non-ASCII characters, it's already ASCII
> and UTF-8, and no CTE will be needed.

So you are suggesting that I should use U+FFFD encoded as UTF-8
rather than '?' as the substitution character?  But earlier you said
that people would probably rather not be forced to deal with Unicode
just because there are invalid bytes in the message.  So that's
probably not what you meant.

Presumably you are suggesting that email5 be smart enough to turn my
example into properly UTF-8/CTE encoded text.  But *that* problem is what
email6 is trying to address.  It just doesn't look practical to address it
directly in the email5 code base, because the email4 codebase that email5
inherits does not provide the correct distinction between bytes and text.
email5 is parsing the input stream *as if* it were ASCII-only CTE text.
I'm trying to extend it to also handle non-ASCII bytes gracefully.
Extending it to actually handle unicode input is a whole different kettle
of sushi[*].

>  > Yes, exactly.  I need to fix the patch to recode using, say,
>  > quoted-printable in that case.
> 
> It really should check for proportions of non-ASCII.  QP would be
> horrible for Japanese or Chinese.

Noted.

>  > DecodedGenerator could still produce the unicode, though, which is
>  > what I believe we want.  (Although that raises the question of
>  > whether DecodedGenerator should also decode the RFC2047 encoded
>  > headers....but that raises a backward compatibility issue).
> 
> Can't really help you there.  While I would want the RFC 2047 headers
> decoded if I were writing new code (which is generally the case for
> me), I haven't really wrapped my head around the issues of porting old
> code using Python2 str to Python3 str here.  My intuition says "no
> problem" (there won't be any MIME-words so the app won't try to decode
> them), but I'm not real sure of that. 

Thinking about this further, I think it is unlikely that an application
using DecodedGenerator would be further processing the headers generated
by it, so I think this is probably a safe enough change, given that
there are few if any Python3 email handling applications at this point.
If anyone knows of a Python2 application that does post-process
DecodedGenerator headers, please let me know.

--David

[*] And I've had an argument with someone who thinks email should
*not* be extended to handle unicode messages with non-ASCII
content, on the grounds that they aren't really email.