[Python-Dev] byteformat() proposal: please critique

Sun Jan 12 08:09:21 CET 2014

On 12 January 2014 12:13, Brett Cannon <brett at python.org> wrote:
> With that flexibility this matches what I have been mulling in the back of
> my head all day. Basically everything that goes in is assumed to be bytes
> unless {:s} says to expect something which can be passed to str() and then
> use some specified encoding in all instances (stupid example following as it
> might be easier with bytes.join, but it gets the point across)::
>
>   formatter = format_bytes('latin1', 'strict')
>   http_response = formatter(b'Content-Type: {:s}\r\n\r\nContent-Length:
> {:s}\r\n\r\n{}', 'image/jpeg', len(data), data)
>
> Nothing fancy, just an easy way to handle having to call str.encode() on
> every text argument that is to end up as bytes as Terry is proposing (and
> I'm fine with defaulting to ASCII/strict with no arguments). Otherwise you
> do what R. David Murray suggested and just have people rely on their own API
> which accepts what they want and then spits out what they want behind the
> scenes.
>
> It basically comes down to how much tweaking of existing Python 2.7
> %/.format() calls people will be expected to make. I'm fine with asking
> people to call a function like what Terry is proposing as it can do away
> with baking in that ASCII is reasonable as well as not require a bunch of
> work without us having to argue over what bytes.format() should or should
> not do. Personally I say bytes.format() is fine but it shouldn't do any text
> encoding which makes its usefulness rather minor (much like the other
> text-like methods that got carried forward in hopes that they would be
> useful to people porting code; maybe we should consider taking them out in
> Python 4 or something if we find out no one is using them).

There are several that are useful for manipulating binary data *as
binary data*, including some of those that assume ASCII compatibility.
Even some of the odd ones (like bytes.title) which we considered
deprecating around 3.2 or so (if I recall correctly) were left because
they're useful for HTTP style headers.

The thing about them all is that even though they do assume ASCII
compatibility, they don't do any implicit conversions between raw
bytes and other formats - they're all purely about transforming binary
data. PEP 460 as it currently stands is in the same vein - it doesn't
blur the lines between binary data and other formats, but it *does*
make binary data easier to work with, and in a way that is a subset of
what Python 2 8-bit strings allowed, further increasing the size of
the Python 2/3 source compatible subset.

The line that is crossed by suggestions like including number
formatting in PEP 460 is that those suggestions *do* introduce
implicit encoding from structured semantic data (a numeric value) to a
serialised format (the ASCII text representation of that number).
Implicitly encoding text (even with the ASCII codec and strict error
handling) similarly blurs the line between binary and text data again,
and is the kind of change that gets rejected as attempting to
reintroduce the Python 2 text model back into the Python 3 core types.

That said, while I don't think such a hybrid type is appropriate as
part of the *core* text model, I agree that such a type *could* be
useful when implementing protocol handling code. That's why I
suggested "asciicompat" to Benno as the package name for the home of
asciistr - I think it could be a good home for various utilities
designed for working with ASCII compatible binary protocols using a
more text-like API than that offered by the bytes type in Python 3.

I actually see much of this debate as akin to that over the API
changes between Google's original ipaddr module and the ipaddress API
in the standard library. The original ipaddr API is fine *if you
already know how IP networks work* - it plays fast and loose with
terminology, but in a way that you can deal with if you already know
the real meaning of the underlying concepts. However, anyone
attempting to go the other way (learning IP networking concepts from
the ipaddr API) will be hopelessly, hopelessly confused because the
terminology is used in *very* loose ways. So ipaddress tightened
things up and made the names more formally correct, aiming to make it
usable both as an address manipulation library *and* as a way of
learning the underlying IP addressing concepts.

I see the Python 2 str type as similar to the ipaddr API - if you
already know what you're doing when it comes to Unicode, then it's
pretty easy to work with. However, if you're trying to use it to
*learn* Unicode concepts, then you're pretty much stuffed, as you get
lost in a mazy of twisty values, as the same data type is used with
very different semantics, depending on which end of a data
transformation you're on (although sometimes you'll get a different
data type, depending on the data *values* involved).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia