[Python-Dev] email package status in 3.X

Mon Jun 21 22:09:52 CEST 2010

At 03:29 PM 6/21/2010 -0400, Toshio Kuratomi wrote:
>On Mon, Jun 21, 2010 at 01:24:10PM -0400, P.J. Eby wrote:
> > At 12:34 PM 6/21/2010 -0400, Toshio Kuratomi wrote:
> > >What do you think of making the encoding attribute a mandatory part of
> > >creating an ebyte object?  (ex: ``eb = ebytes(b, 'euc-jp')``).
> >
> > As long as the coercion rules force str+ebytes (or str % ebytes,
> > ebytes % str, etc.) to result in another ebytes (and fail if the str
> > can't be encoded in the ebytes' encoding), I'm personally fine with
> > it, although I really like the idea of tacking the encoding to bytes
> > objects in the first place.
> >
>I wouldn't like this.  It brings us back to the python2 problem where
>sometimes you pass an ebyte into a function and it works and other times you
>pass an ebyte into the function and it issues a traceback.

For stdlib functions, this isn't going to happen unless your ebytes' 
encoding is not compatible with the ascii subset of unicode, or the 
stdlib function is working with dynamic data...  in which case you 
really *do* want to fail early!

I don't see this as a repeat of the 2.x situation; rather, it allows 
you to cause errors to happen much *earlier* than they would 
otherwise show up if you were using unicode for your encoded-bytes data.

For example, if your program's intent is to end up with latin-1 
output, then it would be better for an error to show up at the very 
*first* point where non-latin1 characters are mixed with your data, 
rather than only showing up at the output boundary!

However, if you promoted mixed-type operation results to unicode 
instead of ebytes, then you:

1) can't preserve data that doesn't have a 1:1 mapping to unicode, and

2) can't detect an error until your data reaches the output point in 
your application -- forcing you to defensively insert ebytes calls 
everywhere (vs. simply wrapping them around a handful of designated 
inputs), or else have to go right back to tracing down where the 
unusable data showed up in the first place.

One thing that seems like a bit of a blind spot for some folks is 
that having unicode is *not* everybody's goal.  Not because we don't 
believe unicode is generally a good thing or anything like that, but 
because we have to work with systems that flat out don't *do* 
unicode, thereby making the presence of (fully-general) unicode an 
error condition that has to be stamped out!

IOW, if you're producing output that has to go into another system 
that doesn't take unicode, it doesn't matter how 
theoretically-correct it would be for your app to process the data in 
unicode form.  In that case, unicode is not a feature: it's a bug.

And as it really *is* an error in that case, it should not pass 
silently, unless explicitly silenced.

>So, what's the advantage of using ebytes instead of bytes?
>
>* It keeps together the text and encoding information when you're taking
>   bytes in and want to give bytes back under the same encoding.
>* It takes some of the boilerplate that people are supposed to do (checking
>   that bytes are legal in a specific encoding) and writes it into the
>   initialization of the object.  That forces you to think about the issue
>   at two points in the code:  when converting into ebytes and when
>   converting out to bytes.  For data that's going to be used with both
>   str and bytes, this is the accepted best practice.  (For exceptions, the
>   byte type remains which you can do conversion on when you want to).

Hm.  For the output case, I suppose that means you might also want 
the text I/O wrappers to be able to be strict about ebytes' encoding.