[Python-Dev] email package status in 3.X
P.J. Eby
pje at telecommunity.com
Mon Jun 21 22:09:52 CEST 2010
At 03:29 PM 6/21/2010 -0400, Toshio Kuratomi wrote:
>On Mon, Jun 21, 2010 at 01:24:10PM -0400, P.J. Eby wrote:
> > At 12:34 PM 6/21/2010 -0400, Toshio Kuratomi wrote:
> > >What do you think of making the encoding attribute a mandatory part of
> > >creating an ebyte object? (ex: ``eb = ebytes(b, 'euc-jp')``).
> >
> > As long as the coercion rules force str+ebytes (or str % ebytes,
> > ebytes % str, etc.) to result in another ebytes (and fail if the str
> > can't be encoded in the ebytes' encoding), I'm personally fine with
> > it, although I really like the idea of tacking the encoding to bytes
> > objects in the first place.
> >
>I wouldn't like this. It brings us back to the python2 problem where
>sometimes you pass an ebyte into a function and it works and other times you
>pass an ebyte into the function and it issues a traceback.
For stdlib functions, this isn't going to happen unless your ebytes'
encoding is not compatible with the ascii subset of unicode, or the
stdlib function is working with dynamic data... in which case you
really *do* want to fail early!
I don't see this as a repeat of the 2.x situation; rather, it allows
you to cause errors to happen much *earlier* than they would
otherwise show up if you were using unicode for your encoded-bytes data.
For example, if your program's intent is to end up with latin-1
output, then it would be better for an error to show up at the very
*first* point where non-latin1 characters are mixed with your data,
rather than only showing up at the output boundary!
However, if you promoted mixed-type operation results to unicode
instead of ebytes, then you:
1) can't preserve data that doesn't have a 1:1 mapping to unicode, and
2) can't detect an error until your data reaches the output point in
your application -- forcing you to defensively insert ebytes calls
everywhere (vs. simply wrapping them around a handful of designated
inputs), or else have to go right back to tracing down where the
unusable data showed up in the first place.
One thing that seems like a bit of a blind spot for some folks is
that having unicode is *not* everybody's goal. Not because we don't
believe unicode is generally a good thing or anything like that, but
because we have to work with systems that flat out don't *do*
unicode, thereby making the presence of (fully-general) unicode an
error condition that has to be stamped out!
IOW, if you're producing output that has to go into another system
that doesn't take unicode, it doesn't matter how
theoretically-correct it would be for your app to process the data in
unicode form. In that case, unicode is not a feature: it's a bug.
And as it really *is* an error in that case, it should not pass
silently, unless explicitly silenced.
>So, what's the advantage of using ebytes instead of bytes?
>
>* It keeps together the text and encoding information when you're taking
> bytes in and want to give bytes back under the same encoding.
>* It takes some of the boilerplate that people are supposed to do (checking
> that bytes are legal in a specific encoding) and writes it into the
> initialization of the object. That forces you to think about the issue
> at two points in the code: when converting into ebytes and when
> converting out to bytes. For data that's going to be used with both
> str and bytes, this is the accepted best practice. (For exceptions, the
> byte type remains which you can do conversion on when you want to).
Hm. For the output case, I suppose that means you might also want
the text I/O wrappers to be able to be strict about ebytes' encoding.
More information about the Python-Dev
mailing list