[Python-ideas] A possible transition plan to bytes-based iteration and indexing for binary data

Sun Jun 15 19:03:16 CEST 2014

On Sun, Jun 15, 2014 at 5:33 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:

> At PyCon earlier this year, Guido (and others) persuaded me that the
> integer based indexing and iteration for bytes and bytearray in Python
> 3 was a genuine design mistake based on the initial Python 3 design
> which lacked an immutable bytes type entirely (so producing integers
> was originally the only reasonable choice).
>
> The earlier design discussions around PEP 467 (which proposes to clean
> up a few other bits and pieces of that original legacy which PEP 3137
> left in place) all treated "bytes indexing returns an integer" as an
> unchangeable aspect of Python 3, since there wasn't an obvious way to
> migrate to instead returning length 1 bytes objects with a reasonable
> story to handle the incompatibility for Python 3 users, even if
> everyone was in favour of the end result.
>
> A few weeks ago I had an idea for a migration strategy that seemed
> feasible, and I now have a very, very preliminary proof of concept up
> at
> https://bitbucket.org/ncoghlan/cpython_sandbox/branch/bytes_migration_experiment
>
> The general principle involved would be to return an integer *subtype*
> from indexing and iteration operations on bytes, bytearray and
> memoryview objects using the "default" format character. That subtype
> would then be detected in various locations and handled the way a
> length 1 bytes object would be handled, rather than the way an integer
> would be handled. The current proof of concept adds such handling to
> ord(), bytes() and bytearray() (with appropriate test cases in
> test_bytes) giving the following results:
>
> >>> b'hello'[0]
> 104
> >>> ord(b'hello'[0])
> 104
> >>> bytes(b'hello'[0])
> b'h'
> >>> bytearray(b'hello'[0])
> bytearray(b'h')
>
> (the subtype is currently visible at the Python level as "types._BytesInt")
>
> The proof of concept doesn't override any normal integer behaviour,
> but a more complete solution would be in a position to emit a warning
> when the result of binary indexing is used as an integer (either
> always, or controlled by a command line switch, depending on the
> performance impact).
>
> With this integer subtype in place for Python 3.5 to provide a
> transition period where both existing integer-compatible operations
> (like int() and arithmetic operations) and selected bytes-compatible
> operations (like ord(), bytes() and bytearray()) are supported, these
> operations could then be switched to producing a normal length 1 bytes
> object in Python 3.6.
>
> It wouldn't be pretty, and it would be a pain to document, but it
> seems feasible. The alternative is for PEP 367 to add a separate bytes
>

I believe you mean PEP 467.

> iteration method, which strikes me as further entrenching a design we
> aren't currently happy with.
>
> Regards,
> Nick.

We just got rid of the mess of having multiple integer types (int vs long),
it'd be a shame to recreate that problem in any form.

The ship has sailed. Python 3 means bytes indexing returns ints. It's well
defined and code has started to depend on it. People who want a b'A'
instead of 0x41 know to use slice notation [n:n+1] instead of [n] to get a
one byte bytes() as that is what is required in code that works in 2.6
through 3.4 today. Anything we do to change it is going to be messier and
more mysterious.

Entertaining the idea anyways: If there is going to be a new type for bytes
indexing, it needs to multiply inherit from both int and bytes so that
isinstance() checks work. We'd need to make sure all C API calls that check
for a specific type actually work with the new one as well (at first glance
I count 57 uses of PyBytes_CheckExact and PyLong_CheckExact in CPython).
The ambiguious operator * and + cases and any similar that Nathaniel Smith
pointed out would still be a problem and a potential source of confusion
for users.

If anything, a new iteration method in PEP 467 that yields length 1 bytes()
makes *some* sense for convenience, but I don't personally see much use for
single byte iteration of any form in a high level language.

It is odd to me that str and bytes *ever* supported iteration. How many
times have we each written code to check that a passed argument was "a
sequence but, oh, wait, not a string, because you didn't *really* mean to
do that". That was a Python 1 decision. Oops. :)

-gps
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20140615/7c7b5375/attachment.html>