Flatten an email Message with a non-ASCII body using 8bit CTE

W. Trevor King wking at tremily.us
Thu Jan 24 02:05:33 EST 2013


Hello list!

I'm trying to figure out how to flatten a MIMEText message to bytes
using an 8bit Content-Transfer-Encoding in Python 3.3.  Here's what
I've tried so far:

  # -*- encoding: utf-8 -*-
  import email.encoders
  from email.charset import Charset
  from email.generator import BytesGenerator
  from email.mime.text import MIMEText
  import sys

  body = 'Ζεύς'
  encoding = 'utf-8'
  charset = Charset(encoding)
  charset.body_encoding = email.encoders.encode_7or8bit

  message = MIMEText(body, 'plain', encoding)
  del message['Content-Transfer-Encoding']
  message.set_payload(body, charset)
  try:
      BytesGenerator(sys.stdout.buffer).flatten(message)
  except UnicodeEncodeError as e:
      print('error with string input:')
      print(e)

  message = MIMEText(body, 'plain', encoding)
  del message['Content-Transfer-Encoding']
  message.set_payload(body.encode(encoding), charset)
  try:
      BytesGenerator(sys.stdout.buffer).flatten(message)
  except TypeError as e:
      print('error with byte input:')
      print(e)

The `del m[…]; m.set_payload()` bits work around #16324 [1] and should
be orthogonal to the encoding issues.  It's possible that #12553 is
trying to address this issue [2,3], but that issue's comments are a
bit vague, so I'm not sure.

The problem with the string payload is that
email.generator.BytesGenerator.write is getting the Unicode string
payload unencoded and trying to encode it as ASCII.  It may be
possible to work around this by encoding the payload so that anything
that doesn't encode (using the body charset) to a 7bit value is
replaced with a surrogate escape, but I'm not sure how to do that.

The problem with the byte payload is that _has_surrogates (used in
email.generator.Generator._handle_text and
BytesGenerator._handle_text) chokes on byte input:

  TypeError: can't use a string pattern on a bytes-like object

For UTF-8, you can get away with:

  message.as_string().encode(message.get_charset().get_output_charset())

because the headers are encoded into 7 bits, so re-encoding them with
UTF-8 is a no-op.  However, if the body charset is UTF-16-LE or any
other encoding that remaps 7bit characters, this hack breaks down.

Thoughts?
Trevor

[1]: http://bugs.python.org/issue16324
[2]: http://bugs.python.org/issue12553
[3]: http://bugs.python.org/issue12552#msg140294

-- 
This email may be signed or encrypted with GnuPG (http://www.gnupg.org).
For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/python-list/attachments/20130124/96ca96ce/attachment.sig>


More information about the Python-list mailing list