[Python-Dev] Python 3.x and bytes

Thu May 19 10:43:54 CEST 2011

OK, summarising the thread so far from my point of view.

1. There are some aspects of the behavior of bytes() objects that
tempt people to think of them as string-like objects (primarily the
b'' literals and their use in repr(), along with the fact that they
fill roles that were filled by str in it's "arbitrary binary data"
incarnation in Python 2.x). The mental model this creates in the
reader is incorrect, as bytes() are far closer to array.array('c') in
their underlying behaviour (and deliberately so - cf. PEP 358, 3112,
3137).

One proposal for addressing this is to add a x'deadbeef' literal and
using that in repr() rather than the bytestring. Another would be to
escape all characters, even printable ASCII, in the bytes()
representation. Both of these are undesirable, as they miss the
original purpose of this behaviour: making it easier to work with the
many ASCII based wire protocols that are in widespread use.

To be honest, I don't think there is a lot we can do here except to
further emphasise in the documentation and elsewhere that *bytes is
not a string type* (regardless of any API similarities retained to
ease transition from the 2.x series). For example, if we have any
lingering references to "byte strings" they should be replaced with
"byte sequences" or "bytes objects" (depending on context, as the
former phrasing also encompasses bytearray objects).

2. As a concrete usability issue, it is awkward to programmatically
check the value of a specific byte when working with an ASCII based
protocol:

  data[i] == b'a' # Intuitive, but always False due to type mismatch
  data[i:i+1] == b'a'  # Works, but clumsy
  data[i] == b'a'[0]  # Ditto (but at least susceptible to compiler
const-expression optimisation)
  data[i] == ord('a') # Clumsy and slow
  data[i] == 97 # Hard to read

Proposals to address this include:
- introduce a "character" literal to allow c'a' as an alternative to ord('a')
    Potentially workable, but leaves the intuitive answer above
silently producing an unexpected answer
- allow 1-element byte sequences to compare equal to the corresponding
integer values.
    - would require reworking of bytes.__hash__ to use the hash of the
contained element when the data length is exactly 1
    - transitivity of equality would recommend also supporting
equivalences such as b'a' == 97.0
    - backwards compatibility concerns arise due to introduction of
new key collisions in dictionaries and sets and other value based
containers
    - yet more string-like behaviour in a type that is *not* a string
(further reinforcing the mistaken impression from point 1)
    - One thing that *isn't* a concern from my point of view is the
fact that we have ample precedent in decimal.Decimal for supporting
implicit coercion in comparison operations while disallowing them in
arithmetic operations (Decimal("1") == 1.0 is allowed, but
Decimal("1") + 1.0 will raise TypeError).

For point 2, I'm personally +0 on the idea of having 1-element bytes
and bytearray objects delegate hashing and comparison operations to
the corresponding integer object. We have the power to make the
obvious code correct code, so let's do that. However, the implications
of the additional key collisions in value based containers may need to
be explored further.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia