[Python-Dev] Patch making the current email package (mostly) support bytes

Tue Oct 5 07:41:12 CEST 2010

R. David Murray writes:
 > On Mon, 04 Oct 2010 12:32:26 -0400, Scott Dial <scott+python-dev at scottdial.com> wrote:
 > > On 10/2/2010 7:00 PM, R. David Murray wrote:
 > > > The clever hack (thanks ultimately to Martin) is to accept 8bit data
 > > > by encoding it using the ASCII codec and the surrogateescape error
 > > > handler.
 > > 
 > > I've seen this idea pop up in a number of threads. I worry that you are
 > > all inventing a new kind of dual that is a direct parallel to Python 2.x
 > > strings.
 > 
 > Yes, that is exactly my worry.

I don't worry about this.  Strings generated by decoding with
surrogate-escape are *different* from other strings: they contain
invalid code units (the naked surrogates).  These cannot be encoded
except with a surrogate-escape flag to .encode(), and sane developers
won't do that unless she knows precisely what she's doing.  This is
not true with Python 2 strings, where all bytes are valid.

 > > Any reasonable 2.x code has to guard on str/unicode and it would seem in
 > > 3.x, if this idiom spreads, reasonable code will have to guard on
 > > surrogate escapes (which actually seems like a more expensive test).
 > 
 > Right, I mentioned that concern in my post.

Again, I don't worry about this.  It is *not* an *extra* cost.  Those
messages are *already broken*, they *will* crash the email module if
you fail to guard against them.  Decoding them to surrogates actually
makes it easier to guard, because you know that even if broken
encodings are present, the parser will still work.  Broken encodings
can no longer crash the parser.  That is a Very Good Thing IMHO.

 > Only if the email package contains a coding error would the
 > surrogates escape and cause problems for user code.

I don't think it is reasonable to internalize surrogates that way;
some applications *will* want to look at them and do something useful
with them (delete them or replace them with U+FFFD or ...).  However,
I argue below that the presence of surrogates already means the user
code is under fire, and this puts the problem in a canonical form so
the user code can prepare for it (if that is desirable).

 > > It seems like this hack is about making the 3.x unicode type more like
 > > the 2.x string type,

Not at all.  It's about letting the parser be a parser, and letting
the application handle broken content, or discard it, or whatever.
Modularity is improved.  This has been a major PITA for Mailman
support over the years: every time the spammers and virus writers come
up with a new idea, there's a chance it will leak out and the email
parser will explode, stopping the show.  These kinds of errors are a
FAQ on the Mailman lists (although much less so in recent years).

 > > How will developers not have to ask themselves whether a given
 > > string is a "real" string or a byte sequence masquerading as a
 > > string? Am I missing something here?

There are two things to say, actually.  First, you're in a war zone.
*All* email is bytes sequences masquerading as text, and if you're not
wearing armor, you're going to get burned.  The idea here is to have
the email package provide the armor and enough instrumentation so you
can do bomb detection yourself (or perhaps just let it blow, if you're
hacking up a quick and dirty script).

Second, there are developers who will not care whether strings are
"real" or "byte sequences in drag", because they're writing MTAs and
the like.  Those people get really upset, and rightly so, when the
parser pukes on broken headers; it is not their app's job at all to
deal with that breakage.

 > I think this question is something that needs to be considered any
 > time using surrogates is proposed.

I don't agree.  The presence of naked surrogates is *always* (assuming
sane programmers) an indication of invalid input.  The question is,
should the parser signal invalidity, or should it allow the
application to decide?  The email module *doesn't have enough
information to decide* whether the invalid input is a "real" problem,
or how to handle it (cf the example of a MTA app).  Note that a
completely naive app doesn't care -- it will crash either way because
it doesn't handle the exception, whether it's raised by the parser or
by a codec when the app tries to do I/O.  A robust app *does* care: if
the parser raises, then the app must provide an alternative parser
good enough to find and fix the invalid bytes.  Clearly it's much
better to pass invalid (but fully parsed) text back to the app in this
case.

Note that if the app really wants the parser to raise rather than pass
on the input, that should be easy to implement at fairly low cost; you
just provide a variable rather than hardcoding the surrogate-escape
flag.