[Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

Sun Jan 12 08:48:52 CET 2014

On 12 January 2014 02:33, M.-A. Lemburg <mal at egenix.com> wrote:
> On 11.01.2014 16:34, Nick Coghlan wrote:
>> While that was an *expedient* (and, in fact, necessary) solution at
>> the time, the fact it is still thoroughly confusing people 13 years
>> later shows it is not a *comprehensible* solution.
>
> FWIW: I quite liked the Python 2 model, but perhaps that's because
> I already knww how Unicode works, so could use it to make my
> life easier ;-)

Right, I tried to capture that in
http://python-notes.curiousefficiency.org/en/latest/python3/questions_and_answers.html#what-actually-changed-in-the-text-model-between-python-2-and-python-3
by pointing out that there are two *very* different kinds of code to
consider when discussing text modelling.

Application code lives in a nice clean world of structured data, text
data and binary data, with clean conversion functions for switching
between them.

Boundary code, by contrast, has to deal with the messy task of
translating between them all.

The Python 2 text model is a convenient model for boundary code,
because it implicitly allows switch between binary and text
interpretations of a data stream, and that's often useful due to the
way protocols and file formats are designed.

However, that kind of implicit switching is thoroughly inappropriate
for *application* code. So Python 3 switches the core text model to
one where implicitly switching between the binary domain and the text
domain is considered a *bad* thing, and we object strongly to any
proposals which suggest blurry the boundaries again, since that is
going back to a boundary code model rather than an application code
one.

I've been saying for years that we may need a third type, but it has
been nigh on impossible to get boundary code developers to say
anything more useful than "I preferred the Python 2 model, that was
more convenient for me". Yes, we know it was (we do maintain both of
them, after all, and did the update for the standard library's own
boundary code), but application developers are vastly more common, so
boundary code developers lost out on that one and we need to come up
with solutions that *respect* the Python 3 text model, rather than
trying to change it back to the Python 2 one.

> Seriously, Unicode has always caused heated discussions and
> I don't expect this to change in the next 5-10 years.
>
> The point is: there is no 100% perfect solution either way and
> when you acknowledge this, things don't look black and white anymore,
> but instead full of colors :-)

It would be nice if more boundary code developers actually did that
rather than coming out with accusatory hyperbole and pining for the
halcyon days of Python 2 where the text model favoured their use case
over that of normal application developers.

> Python 3 forces people to actually use Unicode; in Python 2 they
> could easily avoid it. It's good to educate people on how it's
> used and the issues you can run into, but let's not forget
> that people are trying to get work done and we all love readable
> code.
>
> PEP 460 just adds two more methods to the bytes object which come
> in handy when formatting binary data; I don't think it has potential
> to muddy the Python 3 text model, given that the bytes
> object already exposes a dozen of other ASCII text methods :-)

I dropped my objections to PEP 460 once Antoine fixed it to respect
the boundaries between binary and text data. It's now a pure binary
interpolation proposal, and one I think is a fine idea - there's no
implicit encoding or decoding involved, it's just a tool for
manipulating binary data.

That leaves the implicit encoding and decoding to the third party
asciistr type, as it should be.

> asciistr is interesting in that it coerces to bytes instead
> of to Unicode (as is the case in Python 2).

Not quite - the idea of asciistr is that it is designed to be a
*hybrid* type, like str was in Python 2. If it interacts with binary
objects, it will give a binary result, if it interacts with text
objects, it will give a text result. This makes it potentially
suitable for use for constants in hybrid binary/text APIs like
urllib.parse, allowing them to be implemented using a shared code path
once again.

The initial experimental implementation only works with 7 bit ASCII,
but the UTF-8 caching in the PEP 393 implementation opens up the
possibility of offering a non-strict mode in the future, as does the
option of allowing arbitrary 8-bit data and disallowing interoperation
with text strings in that case.

> At the moment it doesn't cover the more common case bytes + str,
> just str + bytes, but let's assume it would,

Right, I suspect we have some overbroad PyUnicode_Check() calls in
CPython that will need to be addressed before this substitution works
seamlessly - that's one of the reasons I've been asking people to
experiment with the idea since at least 2010 and let us know what
doesn't work (nobody did though, until Benno agreed to try it out
because it sounded like an interesting puzzle - I guess everyone else
just found it easier to accuse us of being clueless idiots rather than
considering trying to meet us halfway).

> then you'd write
>
> ...
> headers += asciistr('Length: %i bytes\n' % 123)

If you're going to wait until *after* the formatting to do the
conversion, you may as well just use encode explicitly:

    headers += ('Length: %i bytes\n' % 123).encode('ascii')

The advantage of asciistr is that it allows you to abstract away the
format strings for the headers in a way explicit encoding doesn't
allow:

    FMT_LENGTH = asciistr('Length: %i bytes\n')

    headers += FMT_LENGTH % 123
    headers += b'\n\n'
    body = b'...'
    socket.send(headers + body)

You could do it inline as well:

    headers += asciistr('Length: %i bytes\n') % 123

But again, that doesn't offer a lot over simply explicitly encoding
that fragment as ASCII.

> With PEP 460, you could write the above as:
> ...
> headers += b'Length: %i bytes\n' % 123
> headers += b'\n\n'
> body = b'...'
> socket.send(headers + body)
> ...
>
> IMO, that's more readable.

At the cost of introducing an implicit encoding step again - it
interpolates numbers into arbitrary binary sequences as ASCII text.
That is thoroughly inappropriate in Python 3 - serialising
semantically significant structured data (like numbers) as ASCII must
always be opt in, either through environmental configuration (which
has its own problems due to some undesirable default behaviour on
POSIX systems - users will "opt in" to ASCII by mistake, not because
they actually intended to), by passing it as an encoding argument, or
by using a third party type like asciistr that is explicitly
documented as only working with ASCII compatible data (whereas, with a
couple of minor exceptions inherited from Python 2, the core bytes
type is designed to work *correctly* with arbitrary binary data, and
just has some *convenience* operations that assume ASCII data).

> Both variants essentially do the same thing: they implicitly
> coerce ASCII text strings to bytes, so conceptually, there's
> little difference.

There's all the difference in the world: asciistr is a separate third
party type that is deliberately designed to only work correctly with
ASCII compatible binary data. If you use it for data that *isn't*
ASCII compatible, then the resulting data corruption is due to using
the wrong type, rather than being an implicit behaviour of a builtin
Python type.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia