[Python-Dev] bytes / unicode

Mon Jun 21 19:29:27 CEST 2010

On Mon, Jun 21, 2010 at 9:46 AM, P.J. Eby <pje at telecommunity.com> wrote:
> At 10:51 PM 6/21/2010 +1000, Nick Coghlan wrote:
>>
>> It may be that there are places where we need to rewrite standard
>> library algorithms to be bytes/str neutral (e.g. by using length one
>> slices instead of indexing). It may be that there are more APIs that
>> need to grow "encoding" keyword arguments that they then pass on to
>> the functions they call or use to convert str arguments to bytes (or
>> vice-versa). But without people trying to port affected libraries and
>> reporting bugs when they find issues, the situation isn't going to
>> improve.
>>
>> Now, if these bugs are already being reported against 3.1 and just
>> aren't getting fixed, that's a completely different story...
>
> The overall impression, though, is that this isn't really a step forward.
>  Now, bytes are the special case instead of unicode, but that special case
> isn't actually handled any better by the stdlib - in fact, it's arguably
> worse.  And, the burden of addressing this seems to have been shifted from
> the people who made the change, to the people who are going to use it.  But
> those people are not necessarily in a position to tell you anything more
> than, "give me something that works with bytes".
>
> What I can tell you is that before, since string constants in the stdlib
> were ascii bytes, and transparently promoted to unicode, stdlib behavior was
> *predictable* in the presence of special cases: you got back either bytes or
> unicode, but either way, you could idempotently upgrade the result to
> unicode, or just pass it on.  APIs were "str safe, unicode aware".  If you
> passed in bytes, you weren't going to get unicode without a warning, and if
> you passed in unicode, it'd work and you'd get unicode back.

Actually, the big problem with Python 2 is that if you mix str and
unicode, things work or crash depending on whether any of the str
objects involved contain non-ASCII bytes.

If one API decides to upgrade to Unicode, the result, when passed to
another API, may well cause a UnicodeError because not all arguments
have had the same treatment.

> Now, the APIs are neither safe nor aware -- if you pass bytes in, you get
> unpredictable results back.

This seems an overgeneralization of a particular bug. There are APIs
that are strictly text-in, text-out. There are others that are
bytes-in, bytes-out. Let's call all those *pure*. For some operations
it makes sense that the API is *polymorphic*, with which I mean that
text-in causes text-out, and bytes-in causes byte-out. All of these
are fine.

Perhaps there are more situations where a polymorphic API would be
helpful. Such APIs are not always so easy to implement, because they
have to be careful with literals or other constants (and even more so
mutable state) used internally -- but it can be done, and there are
plenty of examples in the stdlib.

The real problem apparently lies in (what I believe is only a few
rare) APIs that are text-or-bytes-in and always-text-out (or
always-bytes-out). Let's call them *hybrid*. Clearly, mixing hybrid
APIs in a stream of pure or polymorphic API calls is a problem,
because they turn a pure or polymorphic overall operation into a
hybrid one.

There are also text-in, bytes-out or bytes-in, text-out APIs that are
intended for encoding/decoding of course, but these are in a totally
different class.

Abstractly, it would be good if there were as few as possible hybrid
APIs, many pure or polymorphic APIs (which it should be in a
particular case is a pragmatic choice), and a limited number of
encoding/decoding APIs, which should generally be invoked at the edges
of the program (e.g., I/O).

> Ironically, it almost *would* have been better if bytes simply didn't work
> as strings at all, *ever*, but if you could wrap them with a bstr() to
> *treat* them as text.  You could still have restrictions on combining them,
> as long as it was a restriction on the unicode you mixed with them.  That
> is, if you could combine a bstr and a str if the *str* was restricted to
> ASCII.

ISTR that we considered something like this and decided to stay away
from it. At this point I think that a successful 3rd party bstr
implementation would be required before we rush to add one to the
stdlib.

> If we had the Python 3 design discussions to do over again, I think I would
> now have stuck with the position of not letting bytes be string-compatible
> at all,

They aren't, unless you consider the presence of some methods with
similar behavior (.lower(), .split() and so on) and the existence of
some polymorphic APIs (see above) as "compatibility".

> and instead proposed an explicit bstr() wrapper/adapter to use them
> as strings, that would (in that case) force coercion in the direction of
> bytes rather than strings.  (And bstr need not have been a builtin - it
> could have been something you import, to help discourage casual usage.)

I'm stil unclear on exactly what bstr is supposed to be, but it sounds
a bit like one of the rejected proposals for having a single
(Unicode-capable) str type that is implemented using different width
encodings (Latin-1, UCS-2, UCS-4) underneath.

> Might this approach lead to some people doing things wrong in the case of
> porting?  Sure.  But there'd be little reason to use it in new code that
> didn't have a real need for bytestring manipulation.
>
> It might've been a better balance between practicality and purity, in that
> it keeps the language pure, while offering a practical way to deal with
> things in bytes if you really need to.  And, bytes wouldn't silently succeed
> *some* of the time, leading to a trap.  An easy inconsistency is worse than
> a bit of uniform chicken-waving.

I still believe that believe that the instances of bytes silently
succeeding *some* of the time refers to specific bugs in specific
APIs, either intentional because of misguided compatibility desires,
or accidental in the haste of trying to convert the entire stdlib to
Python 3 in a finite time.

> Is it too late to make that tradeoff?  Probably.  Certainly it's not
> practical to *implement* outside the language core, and removing string
> methods would fux0r anybody whose currently-ported code relies on bytes
> objects having string-like methods.
>
>

-- 
--Guido van Rossum (python.org/~guido)