[Python-Dev] Python 3.x and bytes

Stephen J. Turnbull stephen at xemacs.org
Thu May 19 10:00:24 CEST 2011


Robert Collins writes:

 > Thats separate to the implementation issues I have mentioned in this
 > thread and previous.

Oops, sorry.

Nevertheless, I personally think that b'a'[0] == 97 is a good idea,
and consistent with everything else in Python.  It's Unicode (str)
that is weird, it's str is surprising when first encountered by a C or
Lisp programmer at first, but not enough to cause a heart attack given
how weird natural language is.  But I don't see why that weirdness (an
element of LIST of TYPE is a LIST of TYPE, hey, young man, you're very
smart but *it's turtles all the way down!*) should be replicated
elsewhere.

If you want your bytes object to behave like a str, it's very easy to
get that (.decode('latin1')), and nobody has yet demonstrated that
this is too time-inefficient for real work, given the other overhead
imposed by Python.  The space inefficiency could be dealt with as Greg
points out (by internally having a Unicode representation using 1 byte
instead of 2 or 4).  But if you want your bytes object to *be* a
string, then you're confused.  It isn't (any more).  Even if it's just
a matter of flipping one bit in the type field, a str-with-unibyte-
representation, is not equal to a bytes object with the same bytes.

For example, you write:

 > urlparse converting bytes to 'str' to operate on them is at best a
 > kludge - you're forcing 5 times the storage (the original bytes + 4
 > bytes-per-byte when its decoded into unicode) to work on something
 > which is defined as a BNF * that uses ascii *.

Indeed it (RFC 3896) does *use* ASCII.  But I think there is confusion
in your words.  This is what the RFC says about that use of ASCII:

   2.  Characters

   The URI syntax provides a method of encoding data, presumably for the
   sake of identifying a resource, as a sequence of characters.  [...]

   The ABNF notation defines its terminal values to be non-negative
   integers (codepoints) based on the US-ASCII coded character set
   [ASCII].  Because a URI is a sequence of characters, we must invert
   that relation in order to understand the URI syntax.  Therefore, the
   integer values used by the ABNF must be mapped back to their
   corresponding characters via US-ASCII in order to complete the syntax
   rules.

Ie, ASCII is *irrelevant* to (the modern definition of) URLs except as
it is a convenient and familiar way to refer to a certain familiar and
rather small set of *characters*.  There are reasons for this (that
I'm not going to rehash here), and they are the *same* reasons why
Python 3's behavior is "correct" IMHO (modulo the issue about the type
of a list element, which I discuss above).

It is true that one might like there to be a literal that expresses
`ord(bytes-object-of-length-one)', ie, something like o'a' == 97.
(This is different from Greg's x'6465616462656566' == b'deadbeef',
which I don't think helps solve the confusion problem although it
would definitely be convenient.)


More information about the Python-Dev mailing list