[Python-Dev] email package status in 3.X

Mon Jun 21 22:28:39 CEST 2010

On Mon, Jun 21, 2010 at 02:46:57PM -0400, P.J. Eby wrote:
> At 02:58 AM 6/22/2010 +0900, Stephen J. Turnbull wrote:
> >Nick alluded to the The One Obvious Way as a change in architecture.
> >
> >Specifically: Decode all bytes to typed objects (str, images, audio,
> >structured objects) at input.  Do no manipulations on bytes ever
> >except decode and encode (both to text, and to special-purpose objects
> >such as images) in a program that does I/O.
> 
> This ignores the existence of use cases where what you have is text
> that can't be properly encoded in unicode.  I know, it's a hard thing
> to wrap one's head around, since on the surface it sounds like
> unicode is the programmer's savior.  Unfortunately, real-world text
> data exists which cannot be safely roundtripped to unicode, and must
> be handled in "bytes with encoding" form for certain operations.
> 
> I personally do not have to deal with this *particular* use case any
> more -- I haven't been at NTT/Verio for six years now.  But I do know
> it exists for e.g. Asian language email handling, which is where I
> first encountered it.  At the time (this *may* have changed), many
> popular email clients did not actually support unicode, so you
> couldn't necessarily just send off an email in UTF-8.  It drove us
> nuts on the project where this was involved (an i18n of an existing
> Python app), and I think we had to compromise a bit in some fashion
> (because we couldn't really avoid unicode roundtripping due to
> database issues), but the use case does actually exist.
> 
> My current needs are simpler, thank goodness.  ;-)  However, they
> *do* involve situations where I'm dealing with *other*
> encoding-restricted legacy systems, such as software for interfacing
> with the US Postal Service that only works with a restricted subset
> of latin1, while receiving mangled ASCII from an ecommerce provider,
> and storing things in what's effectively a latin-1 database.  Being
> able to easily assert what kind of bytes I've got would actually let
> me catch errors sooner, *if* those assertions were being checked when
> different kinds of strings or bytes were being combined.  i.e., at
> coercion time).
> 
While it's certainly possible that you have a grapheme that has no
corresponding unicode codepoint, it doesn't sound like this is the case
you're dealing with here.  You talk about "restricted subset of latin1"
but all of latin1's graphemes have unicode codepoints.  You also talk about
not being able to "send off an email in UTF-8" but UTF-8 is an encoding of
unicode, not unicode itself.  Similarly, the statement that some email
clients don't support unicode isn't very clear as to actual problem.  The
email client supports displaying graphemes using glyphs present on the
computer.  As long as the graphemes needed have a unicode codepoint, using
unicode inside of your application and then encoding to bytes on the way out
works fine.

Even in cases where there's no unicode codepoint for the grapheme that
you're receiving unicode gives you a way out.  It provides you a private use
area where you can map the graphemes to unused codepoints.  Your
application keeps a mapping from that codepoint to the particular byte
sequence that you want.  Then write you a codec that converts from unicode w/
these private codepoints into your particular encoding (and from bytes into
unicode).

-Toshio
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-dev/attachments/20100621/16322d58/attachment.pgp>