[issue19063] Python 3.3.3 encodes emails containing non-ascii data as 7bit

R. David Murray report at bugs.python.org
Wed Nov 20 20:20:52 CET 2013


R. David Murray added the comment:

Vajrasky: thanks for taking a crack at this, but, well, there are a lot of subtleties involved here, due to the way the organic growth of the email package over many years has led to some really bad design issues.

It took me a lot of time to boot back up my understanding of how all this stuff hangs together (answer: badly).  After wandering down many blind alleys, the problem turns out to be yet one more disconnect in the model.  We previously fixed the issue where if set_payload was passed binary data bad things would happen.  That made the model more consistent, in that _payload was now a surrogateescaped string when the payload was specified as binary data.

But what the model *really* needs is that _payload *always* be an ascii+surrogateescape string, and never a full unicode string.  (Yeah, this is a sucky model...it ought to always be binary instead, but we are dealing with legacy code here.)

Currently it can be a unicode string.  If it is, set_charset turns it into an ascii only string by encoding it with the qp or base64 CTE.  This is pretty much just by luck, though.

If you set body_encode to None what happens is that the encode_7or8bit encoder thinks the string is 7bit because it does get_payload(decode=True) which, because the model invariant was broken, turns into a raw-unicode-escape string, which is a 7bit representation.  That doesn't affect the payload, but it does result in wrong CTE being used.

The fix is to fix the model invariant by turning a unicode string passed in to set_payload into an ascii+surrogateescape string with the escaped bytes being the unicode encoded to the output charset.

Unfortunately it is also possible to call set_payload without a charset, and *then* call set_charset.  To keep from breaking the code of anyone currently doing that, I had to allow a full unicode _payload, and detect it in set_charset.

My plan is to fix that in 3.4, causing a backward compatibility break because it will no longer be possible to call set_payload with a unicode string containing non-ascii if you don't also provide a character set.  I believe this is an acceptable break, since otherwise you *must* leave the model in an ambiguous state, and you have the possibility "leaking" unicode characters out into your wire-format message, which would ultimately result in either an exception at serialization time or, worse, mojibake.

Patch attached.

----------
stage:  -> patch review
type:  -> behavior
versions:  -Python 3.2
Added file: http://bugs.python.org/file32730/support_8bit_charset_cte.patch

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue19063>
_______________________________________


More information about the Python-bugs-list mailing list