[Python-ideas] Stop displaying elements of bytes objects as printable ASCII characters in CPython 3

Thu Sep 11 03:27:19 CEST 2014

On 11 September 2014 09:23, Chris Lasher <chris.lasher at gmail.com> wrote:
> On Wed, Sep 10, 2014 at 3:09 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:
>>
>> In Python 3, "bytes" is still a hybrid type that can hold:
>>
>> * arbitrary binary data
>> * binary data that contains ASCII segments
>
> Let me be clear. Here are things this proposal does NOT include:
>
> * Removing string-like methods from bytes
> * Removing ASCII from bytes literals
>
> Those have proven incredibly useful to the Python community. I appreciate
> that. This proposal does not take these behaviors away from bytes.
>
> Here's what my proposal DOES include:
>
> 1. Adjust the behavior of repr() on a bytes instance such that only
> hexadecimal codes appear. The returned value would be the text displaying
> the bytes literal of hexadecimal codes that would reproduce the bytes
> instance.

This is not an acceptable change, for two reasons:

1. It's a *major* compatibility break. It breaks single source Python
2/3 development, it breaks doctests, it breaks user expectations.
2. It breaks the symmetry between the bytes literal format and their
representation.

It's important to remember we changed *from* a pure binary
representation back to the current hybrid representation. It's not an
accident or oversight, it's a deliberate design choice, and the
reasons driving that original decision haven't changed in the last 8+
years.

> 2. Provide a method (suggested: "bytes.asciify") that returns a printable
> representation of bytes that replaces bytes whose values map to printable
> ASCII glyphs with the glyphs. The returned value would be the text
> displaying the bytes literal of ASCII glyphs and hexadecimal codes that
> would reproduce the bytes instance. If you liked the behavior of repr() on
> bytes in Python 3.0 through 3.4 (or 3.5), it's still available via this
> method call!

Except that method call won't be available in Python 2 code, and thus
not usable in single source Python 2/3 code bases. That's still an
incredibly important environment for people to be able to program in,
and we're generally aiming to make the common subset *bigger* in
Python 3.5 (e.g. by adding bytes.__mod__), not smaller.

> 3. Optionally, provide a method (suggested: "bytes.hexlify") which
> implements the code for creating the printable representation of the bytes
> with hexadecimal values only, and call this method in bytes.__repr__.

As per the discussion on issue 9951, it is likely Python 3.5 will
either offer bytes.hex() and bytearray.hex() methods (and perhaps even
memoryview.hex()).

I have also filed issue 22385 to propose allowing the "x" and "X"
string formatting characters (for str.format and the format builtin)
to accept arbitrary bytes-like objects.

*Additive* changes like that to make it easier to work with pure
binary data are relatively non-controversial (although there may still
be some argument over *which* of those changes are worth including).

> What you haven't said so far, however, and what I still don't know, is
> whether or not the core team has already tried providing a method on bytes
> objects à la the proposed .asciify() for projecting bytes as ASCII
> characters, and rejected that on the basis of it being too inconvenient for
> the vast majority of Python use cases.

That option was never really on the table, as once we decided back to
switch to a hybrid ASCII representation, the obvious design model to
use was the Python 2 str type, which has inherently hybrid behaviour,
and uses the literal form for the "obj == eval(repr(obj))" round trip.

> Did the core team try this, before deciding that this should be the result
> from repr() should automatically rewrite printable ASCII characters in place
> of hex values for bytes?
>
> So far, I've heard a lot of requests to keep the behavior because it's
> convenient. But how inconvenient is it to call bytes.asciify()? Are those
> not in favor of changing the behavior of repr() really going to sit behind
> the argument that the effort expended in typing ten more characters ought to
> guarantee that thousands of other programmers are going to have to figure
> out why there's letters in their bytes – or rather, how there's actually NOT
> letters in their bytes?

No, we're not keeping it because it's convenient, we're keeping it
because changing it would be a major compatibility break for (at best)
a small reduction in beginner confusion. This change simply wouldn't
provide sufficient benefit to justify the massive scale of the
disruption it would cause.

By contrast, adding better *binary* representation tools is easy (they
pose no backwards compatibility challenges), and hence the preferred
choice. When teaching beginners, explaining the difference between:

    >>> b"abc"
    b'abc'
    >>> b"abc".hex()
    '616263'

Is likely to be pretty straightforward (and will teach them the
relevant concept of ASCII based vs hexadecimal representations for
binary data).

Consider the proposed alternative, which is to instead have to explain:

    >>> b"abc"
    b'\x61\x62\x63'
    >>> b"abc".hex()
    '616263'
    >>> b"abc".ascii()
    'abc'

That's 3 different representations when there are only two underlying
concepts to be learned.

> And once again, we are talking about changing behavior that is unspecified
> by the Python 3 language specification.

Something being underspecified in the language specification doesn't
mean we have free rein to change it on a whim - sometimes it just
means there's an assumed detail that hasn't been explicitly stated,
but implementors of alternative implementations hadn't previously
commented on the omission because they just followed the behaviour of
CPython as the reference interpreter, or the requirements of the
regression test suite.

It's really necessary to look at the regression test suite, along with
the written specification, as things that aren't part of the language
spec are marked as "CPython only". Cases where it's CPython that is
out of line when other interpreter implementations discover a
compatibility issue get filed as CPython bugs (like the one where we
sometimes get the operand precedence wrong if both sequences in a
binary concatenation operation are implemented in C and the sequences
are of different types).

In this case, the underspecification relates to the fact that for
builtin types that have dedicated syntax, the expectation is that
their repr will use that dedicated syntax. This is not currently
stated explicitly in the language reference (and I agree it probably
should be), but it's tested extensively by the regression test suite,
so it becomes a backwards compatibility constraint and an alternative
interpreter compatibility constraint.

> The language is gaining a reputation
> for confusing the two

It isn't "gaining" that reputation, it has always had it. The
reputation for it is actually *reducing* over time, as we spend more
time working with other implementations like PyPy, Jython and
IronPython to get the CPython implementation details marked
appropriately.

(C)Python itself hasn't changed in this regard - we're just starting
to do a better job of getting the wildly divergent groups of users
actually talking to each other (with occasional fireworks as people
have to come to grips with some radically different viewpoints on the
nature and purpose of software development).

In particular, we're starting to see folks that had previously focused
almost entirely on the application programming and network service
development side of Python (which tends to heavily abstract away the C
layer) start to learn more about the system orchestration, hardware
automation and scientific programming side of Python that lets you
dive as deeply into the machine internals as you like.

Most language runtimes only let you handle one or the other of those
categories well - CPython is a relatively rare breed in supporting
both, which *does* have consequences that make many of our design
decisions seem weird to folks that aren't looking at *all* the use
cases for the language in general, and the CPython runtime in
particular.

> however, as written by Armin Ronacher [1]:
>
>> Python is definitely a language that is not perfect. However I think what
>> frustrates me about the language are largely problems that have to do with
>> tiny details in the interpreter and less the language itself. These
>> interpreter details however are becoming part of the language and this is
>> why they are important.
>
> I feel passionately this implicit ASCII-translation behavior should not
> propagate into further releases CPython 3, and I don't want to see it become
> a de facto specification due to calcification.

It's not a de facto specification it's a deliberate design choice,
made before Python 3.0 was even released, and captured by the
regression test suite.

> We're talking about the next
> 10 to 15 years. Nobody guaranteed the behavior of repr() so far. With the
> bytes.asciify() method (or whatever it may be called), we have a fair
> compromise, plus a more explicit specification of behavior of bytes in
> Python 3.

Lots of folks don't like the fact that CPython doesn't completely hide
the underlying memory model of C from the user - it's a deliberately
leaky abstraction. The approach certainly has its downsides, but that
leaky abstraction is what allows people to be confident that they can
use Python as a convenient orchestration language, knowing that we
will have easy access to the kind of low level control offered by C
(and other systems programming languages) if we need it. This is why
the scientific Python stack currently works best on CPython, with the
ports to PyPy, Jython and IronPython (which all abstract away the C
layer far more heavily) at varying stages of maturity - it's simply
harder to do array oriented programming in those environments, since
the language runtimes weren't built with that use case in mind
(neither was CPython, but the relatively close coupling to the C layer
enabled the capability anyway).

Computers are complicated layers of messy and leaky abstractions.
Working too hard at hiding those layers from the user just means
developers can't bypass the abstraction easily when they know what
they need for their current use case better than the original author
of the language runtime.

Regards,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia