[Python-Dev] bytes / unicode

Mon Jun 21 04:30:01 CEST 2010

At 11:47 PM 6/20/2010 +0200, Antoine Pitrou wrote:
>On Sun, 20 Jun 2010 14:40:56 -0400
>"P.J. Eby" <pje at telecommunity.com> wrote:
> >
> > Actually, I would say that it's more that (in the network protocol
> > case) we *have* bytes, some of which we would like to *treat* as
> > text, yet do not wish to constantly convert back and forth to
> > full-blown unicode
>
>Well, then why don't you just stick with a bytes object?

Because the stdlib is not consistent in how well it handles bytes objects.

> > While reading over this thread, I'm wondering whether at least my
> > (WSGI-related) problems in this area would be solved by the
> > availability of a type (say "bstr") that was simply a wrapper
> > providing string-like behavior over an underlying bytes, byte array,
> > or memoryview, that would produce objects of compatible type when
> > combined with strings (by encoding them to match).
>
>This really sounds horrible. Python 3 was designed precisely to
>discourage ad hoc mixing of bytes and unicode.

Who said ad hoc mixing?  The point is to have a simple way to ensure 
that my bytes don't get implicitly converted to unicode, and 
(ideally) don't have to get converted *back*, either.

The idea that by passing bytes to the stdlib, I randomly get back 
either bytes or unicode (i.e. undocumentedly and inconsistently 
between different stdlib APIs, as well as possibly dependent on 
runtime conditions), is NOT "discouraging ad hoc mixing".

> > seems so much saner than writing *this* everywhere:
> >
> >       newurl = str(urljoin(str(base, 'latin-1'), 'subdir'), 'latin-1')
>
>urljoin already returns an str object. Why do you want to decode it
>again?

Ugh.  I meant:

    newurl = urljoin(str(base, 'latin-1'), 'subdir').encode('latin-1')

Which just goes to the point of how ridiculous it is to have to 
convert things to strings and back again to use APIs that ought to 
just handle bytes properly in the first place.

(I don't know if there are actually any problems in the case of 
urljoin; I wasn't the person who originally brought up the "stdlib 
not treating URLs as bytestrings in 3.x" issue on the 
Web-SIG.  Somewhere along the line I got the impression that urljoin 
was one such API, but in researching the issue it looks like maybe 
the canonical example was qsl_parse.)

It's possible that the stdlib situation has improved tremendously 
since then, of course.  I don't know if the bug was reported, or how 
many remain.

And it's precisely the part where I don't know how many remain that 
keeps me from doing more than idly thinking about porting any of my 
libraries (let alone apps) to Python 3.x.  The fact that the stdlib 
itself has these sorts of issues raises major red flags to me about 
whether the One Obvious Way has yet been found.  If the stdlib 
maintainers don't agree on the One Obvious Way, that seems even 
worse.  Or if there is such a Way, but nobody has documented its 
practices yet, that's almost the same thing.

I also find it weird that there seem to be two camps on this subject, 
one of which claims that All Is Well And There Is No Problem -- but I 
do not recall seeing anyone who was in the "What do I do; this 
doesn't seem ready" camp who switched sides and took the time to 
write down what made them realize that they were wrong about there 
being a problem, and what steps they had to take.  The existence of 
one or more such documents would certainly ease my mind, and I 
imagine that of other people who are less waiting for others' 
libraries, than for the stdlib (and/or language) itself to settle.

(Or more precisely, for it to be SEEN to have settled.)