[Python-Dev] Python 3.x and bytes

Thu May 19 19:43:02 CEST 2011

On Thu, May 19, 2011 at 1:43 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:
> OK, summarising the thread so far from my point of view.
>
> 1. There are some aspects of the behavior of bytes() objects that
> tempt people to think of them as string-like objects (primarily the
> b'' literals and their use in repr(), along with the fact that they
> fill roles that were filled by str in it's "arbitrary binary data"
> incarnation in Python 2.x). The mental model this creates in the
> reader is incorrect, as bytes() are far closer to array.array('c') in
> their underlying behaviour (and deliberately so - cf. PEP 358, 3112,
> 3137).

I think most of this "wrong mental model" is actually due to people
not having completely internalized the Python 3 way.

> One proposal for addressing this is to add a x'deadbeef' literal and
> using that in repr() rather than the bytestring. Another would be to
> escape all characters, even printable ASCII, in the bytes()
> representation. Both of these are undesirable, as they miss the
> original purpose of this behaviour: making it easier to work with the
> many ASCII based wire protocols that are in widespread use.

Indeed, -1 on both.

> To be honest, I don't think there is a lot we can do here except to
> further emphasise in the documentation and elsewhere that *bytes is
> not a string type* (regardless of any API similarities retained to
> ease transition from the 2.x series). For example, if we have any
> lingering references to "byte strings" they should be replaced with
> "byte sequences" or "bytes objects" (depending on context, as the
> former phrasing also encompasses bytearray objects).

+1

> 2. As a concrete usability issue, it is awkward to programmatically
> check the value of a specific byte when working with an ASCII based
> protocol:
>
>  data[i] == b'a' # Intuitive, but always False due to type mismatch
>  data[i:i+1] == b'a'  # Works, but clumsy
>  data[i] == b'a'[0]  # Ditto (but at least susceptible to compiler
> const-expression optimisation)
>  data[i] == ord('a') # Clumsy and slow
>  data[i] == 97 # Hard to read
>
> Proposals to address this include:
> - introduce a "character" literal to allow c'a' as an alternative to ord('a')

-1; the result is not a *character* but an integer. I'm personally
favoring using b'a'[0] and possibly hiding this in a constant
definition.

>    Potentially workable, but leaves the intuitive answer above
> silently producing an unexpected answer

I'm not convinced that that problem is any worse than other
comparison-related problems. E.g. b'a' == 'a' also always returns
False (most likely it'll be disguised by at least one operand being a
variable of course.)

> - allow 1-element byte sequences to compare equal to the corresponding
> integer values.
>    - would require reworking of bytes.__hash__ to use the hash of the
> contained element when the data length is exactly 1
>    - transitivity of equality would recommend also supporting
> equivalences such as b'a' == 97.0
>    - backwards compatibility concerns arise due to introduction of
> new key collisions in dictionaries and sets and other value based
> containers
>    - yet more string-like behaviour in a type that is *not* a string
> (further reinforcing the mistaken impression from point 1)
>    - One thing that *isn't* a concern from my point of view is the
> fact that we have ample precedent in decimal.Decimal for supporting
> implicit coercion in comparison operations while disallowing them in
> arithmetic operations (Decimal("1") == 1.0 is allowed, but
> Decimal("1") + 1.0 will raise TypeError).
>
> For point 2, I'm personally +0 on the idea of having 1-element bytes
> and bytearray objects delegate hashing and comparison operations to
> the corresponding integer object. We have the power to make the
> obvious code correct code, so let's do that. However, the implications
> of the additional key collisions in value based containers may need to
> be explored further.

My gut feeling about this is that this will probably introduce some
confusing or unintended side effect elsewhere, and I am -1 on this
change.

-- 
--Guido van Rossum (python.org/~guido)