[Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

Wed Feb 15 00:13:47 CET 2006

On 2/13/06, Phillip J. Eby <pje at telecommunity.com> wrote:
> At 04:29 PM 2/13/2006 -0800, Guido van Rossum wrote:
> >On 2/13/06, Phillip J. Eby <pje at telecommunity.com> wrote:
> > > I didn't mean that it was the only purpose.  In Python 2.x, practical code
> > > has to sometimes deal with "string-like" objects.  That is, code that takes
> > > either strings or unicode.  If such code calls bytes(), it's going to want
> > > to include an encoding so that unicode conversions won't fail.
> >
> >That sounds like a rather hypothetical example. Have you thought it
> >through? Presumably code that accepts both str and unicode either
> >doesn't care about encodings, but simply returns objects of the same
> >type as the arguments -- and then it's unlikely to want to convert the
> >arguments to bytes; or it *does* care about encodings, and then it
> >probably already has to special-case str vs. unicode because it has to
> >control how str objects are interpreted.
>
> Actually, it's the other way around.  Code that wants to output
> uninterpreted bytes right now and accepts either strings or Unicode has to
> special-case *unicode* -- not str, because str is the only "bytes type" we
> currently have.

But this is assuming that the str input is indeed uninterpreted bytes.
That may be a tacit assumption or agreement but it may be wrong. Also,
there are many ways to interpret "uninterpreted bytes" -- is it an
image, a sound file, or UTF-8 text? In 2 out of those 3, passing
unicode is more likely a bug than anything else (except in Jython).

> This creates an interesting issue in WSGI for Jython, which of course only
> has one (unicode-based) string type now.  Since there's no bytes type in
> Python in general, the only solution we could come up with was to treat
> such strings as latin-1:

I believe that's the general convention in Jython, as it matches the
default (albeit deprecated) conversion between bytes and characters in
Java itself.

>      http://www.python.org/peps/pep-0333.html#unicode-issues
>
> This is why I'm biased towards latin-1 encoding of unicode to bytes; it's
> "the same thing" as an uninterpreted string of bytes.

But in CPython this is not how this is generally done.

> I think the difference in our viewpoints is that you're still thinking
> "string" thoughts, whereas I'm thinking "byte" thoughts.  Bytes are just
> bytes; they don't *have* an encoding.

I think when one side of the equation is Unicode, in CPython, I can be
forgiven for thinking string thoughts, since Unicode is never used to
carry binary bytes in CPython.

You may have to craft some kind of different rule for Jython; it
doesn't have a default encoding used when str meets unicode.

> So, if you think of "converting a string to bytes" as meaning "create an
> array of numerals corresponding to the characters in the string", then this
> leads to a uniform result whether the characters are in a str or a unicode
> object.  In other words, to me, bytes(str_or_unicode) should be treated as:
>
>      bytes(map(ord, str_or_unicode))
>
> In other words, without an encoding, bytes() should simply treat str and
> unicode objects *as if they were a sequence of integers*, and produce an
> error when an integer is out of range.  This is a logical and consistent
> interpretation in the absence of an encoding, because in that case you
> don't care about the encoding - it's just raw data.

I see your point (now that you mentioned Jython). But I still don't
think that this is a good default for CPython.

> If, however, you include an encoding, then you're stating that you want to
> encode the *meaning* of the string, not merely its integer values.

Note that in Python 3000 we won't be using str/unicode to carry
integer values around, since we will have the bytes type. So there, it
makes sense to think of the conversion to always involve an encoding,
possibly a default one. (And I think the default might more usefully
be UTF-8 then.)

> >What would bytes("abc\xf0", "latin-1") *mean*? Take the string
> >"abc\xf0", interpret it as being encoded in XXX, and then encode from
> >XXX to Latin-1. But what's XXX? As I showed in a previous post,
> >"abc\xf0".encode("latin-1") *fails* because the source for the
> >encoding is assumed to be ASCII.
>
> I'm saying that XXX would be the same encoding as you specified.  i.e.,
> including an encoding means you are encoding the *meaning* of the string.

That would be the same as ignoring the encoding argument when the
input is str in CPython 2.x, right? I believe we started out saying we
didn't want to ignore the encoding. Perhaps we need to reconsider
that, given the Jython requirement? Then code that converts str to
bytes and needs to be portable between Jython and CPython could write

  b = bytes(s, "latin-1")

> However, I believe I mainly proposed this as an alternative to having
> bytes(str_or_unicode) work like bytes(map(ord,str_or_unicode)), which I
> think is probably a saner default.

Sorry, i still don't buy that.

> >Your argument for symmetry would be a lot stronger if we used Latin-1
> >for the conversion between str and Unicode. But we don't.
>
> But that's because we're dealing with its meaning *as a string*, not merely
> as ordinals in a sequence of bytes.

Well, *sometimes* the user *meant* it as a string, and *sometimes* she
*didn't*. But we can't tell. I think it's safer to force her to be
explicit.

> >  I like the
> >other interpretation (which I thought was yours too?) much better: str
> ><--> bytes conversions don't use encodings by simply change the type
> >without changing the bytes;
>
> I like it better too.  The part you didn't like was where MAL and I believe
> this should be extended to Unicode characters in the 0-255 range also.  :)

I still don't.

> >There's one property that bytes, str and unicode all share: type(x[0])
> >== type(x), at least as long as len(x) >= 1. This is perhaps the
> >ultimate test for string-ness.
> >
> >Or should b[0] be an int, if b is a bytes object? That would change
> >things dramatically.
>
> +1 for it being an int.  Heck, I'd want to at least consider the
> possibility of introducing a character type (chr?) in Python 3.0, and
> getting rid of the "iterating a string yields strings"
> characteristic.  I've found it to be a bit of a pain when dealing with
> heterogeneous nested sequences that contain strings.

Can you give an example of that pain?

What would a chr type behave like?  Would it be a tiny int or a tiny
string or something else again? Would you write its literals as 'c'?

This would be a huge change and it needs more thought going into it
than "sometimes it can be a bit of a pain", since I'm sure that's also
true with the char-as-tiny-int interpretation.

> >There's also the consideration for APIs that, informally, accept
> >either a string or a sequence of objects. Many of these exist, and
> >they are probably all being converted to support unicode as well as
> >str (if it makes sense at all). Should a bytes object be considered as
> >a sequence of things, or as a single thing, from the POV of these
> >types of APIs? Should we try to standardize how code tests for the
> >difference? (Currently all sorts of shortcuts are being taken, from
> >isinstance(x, (list, tuple)) to isinstance(x, basestring).)
>
> I'm inclined to think of certain features at least in terms of the buffer
> interface, but that's not something that's really exposed at the Python level.

(And where it is, it's wrong. :-)

If bytes support the buffer interface, we get another interesting
issue -- regular expressions over bytes. Brr.

--
--Guido van Rossum (home page: http://www.python.org/~guido/)