[Python-Dev] Patch making the current email package (mostly) support bytes

Wed Oct 6 05:22:18 CEST 2010

Nick Coghlan writes:

 > - if you pass in bytes data and know what you are doing, then you can
 > access that raw bytes data and do your own decoding

At what level, though?

To take an interesting example I used to see frequently:

From: taro at tokyo.jp
      (Taro Yamada in 8-bit Shift JIS)

So I guess you are suggesting that the email module can RFC 822 parse
that, and

1.  Refuse to return the unwrapped (ie, single line) form of the whole
    field, except as bytes.
2.  Refuse to return the content of the From field, except as bytes.
3.  Return the email address parsed from the From field.
4.  Refuse to return the comment, except as bytes.

That's fine.  But suppose I have a private or newly defined header
that is structured?  Now I have two choices:

1.  Write a version of my private parser for both str (the normal
    case) and bytes (if accessing the value as str raises)

2.  Always get the bytes and convert them to str (probably using the
    same .decode('ascii','surrogate-escape') call that email uses but
    won't let me have the value of!), then use a common str parser.
    Note that this is more problematic than it looks, since the
    appropriate base codec may require information from higher-level
    structures (eg, qp codec tags or a Content-Type header's charset
    field).

Why should I reproduce email's logic here?  I don't care if the
default or concise API raises on surrogates in the str value.  But I'm
pretty sure that I will want to use str values containing surrogates
in these contexts (for the same reasons that email module does, for
example), rather than work with bytes sometimes and strs sometimes.

Please provide a way to return strs-with-surrogates if I ask for them.