[Python-Dev] bytes / unicode
P.J. Eby
pje at telecommunity.com
Mon Jun 21 04:30:01 CEST 2010
At 11:47 PM 6/20/2010 +0200, Antoine Pitrou wrote:
>On Sun, 20 Jun 2010 14:40:56 -0400
>"P.J. Eby" <pje at telecommunity.com> wrote:
> >
> > Actually, I would say that it's more that (in the network protocol
> > case) we *have* bytes, some of which we would like to *treat* as
> > text, yet do not wish to constantly convert back and forth to
> > full-blown unicode
>
>Well, then why don't you just stick with a bytes object?
Because the stdlib is not consistent in how well it handles bytes objects.
> > While reading over this thread, I'm wondering whether at least my
> > (WSGI-related) problems in this area would be solved by the
> > availability of a type (say "bstr") that was simply a wrapper
> > providing string-like behavior over an underlying bytes, byte array,
> > or memoryview, that would produce objects of compatible type when
> > combined with strings (by encoding them to match).
>
>This really sounds horrible. Python 3 was designed precisely to
>discourage ad hoc mixing of bytes and unicode.
Who said ad hoc mixing? The point is to have a simple way to ensure
that my bytes don't get implicitly converted to unicode, and
(ideally) don't have to get converted *back*, either.
The idea that by passing bytes to the stdlib, I randomly get back
either bytes or unicode (i.e. undocumentedly and inconsistently
between different stdlib APIs, as well as possibly dependent on
runtime conditions), is NOT "discouraging ad hoc mixing".
> > seems so much saner than writing *this* everywhere:
> >
> > newurl = str(urljoin(str(base, 'latin-1'), 'subdir'), 'latin-1')
>
>urljoin already returns an str object. Why do you want to decode it
>again?
Ugh. I meant:
newurl = urljoin(str(base, 'latin-1'), 'subdir').encode('latin-1')
Which just goes to the point of how ridiculous it is to have to
convert things to strings and back again to use APIs that ought to
just handle bytes properly in the first place.
(I don't know if there are actually any problems in the case of
urljoin; I wasn't the person who originally brought up the "stdlib
not treating URLs as bytestrings in 3.x" issue on the
Web-SIG. Somewhere along the line I got the impression that urljoin
was one such API, but in researching the issue it looks like maybe
the canonical example was qsl_parse.)
It's possible that the stdlib situation has improved tremendously
since then, of course. I don't know if the bug was reported, or how
many remain.
And it's precisely the part where I don't know how many remain that
keeps me from doing more than idly thinking about porting any of my
libraries (let alone apps) to Python 3.x. The fact that the stdlib
itself has these sorts of issues raises major red flags to me about
whether the One Obvious Way has yet been found. If the stdlib
maintainers don't agree on the One Obvious Way, that seems even
worse. Or if there is such a Way, but nobody has documented its
practices yet, that's almost the same thing.
I also find it weird that there seem to be two camps on this subject,
one of which claims that All Is Well And There Is No Problem -- but I
do not recall seeing anyone who was in the "What do I do; this
doesn't seem ready" camp who switched sides and took the time to
write down what made them realize that they were wrong about there
being a problem, and what steps they had to take. The existence of
one or more such documents would certainly ease my mind, and I
imagine that of other people who are less waiting for others'
libraries, than for the stdlib (and/or language) itself to settle.
(Or more precisely, for it to be SEEN to have settled.)
More information about the Python-Dev
mailing list