[Python-ideas] bytes indexing behavior

Nick Coghlan ncoghlan at gmail.com
Tue Jun 7 16:07:14 EDT 2016


On 7 June 2016 at 12:01, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> Serhiy Storchaka writes:
>
>  > I think representing bytes as an array of ints was good decision. If you
>  > need indexing to return a substring, you should use str instead. It is
>  > as well memory efficient thanks to PEP 393.
>
> You can do this by using latin-1 as the codec, but that's pretty
> unpleasant, because of the risk of combining with another str and
> getting mojibake.
>
> I have long thought that it would be interesting to have a codec and
> an extension to PEP 393 that gives "asciibytes" behavior.  That is,
> the codec simply slops the bytes into the 8-bit storage of a string,
> but when joined with another string the result types are:
>
> asciibytes        other arg        result
>  has 8bit           type            type
>    yes            pure ascii     asciibytes
>    yes            asciibytes     asciibytes
>    yes            other str      str with 8bit bytes from asciibytes
>                                  encoded as PEP 383 surrogateescape
>                                  (note: promotes latin1 to 2-byte-wide)
>     no             whatever      whatever
>
> I think Nick actually had a module that worked pretty much like this,
> but he never pushed it.  I've never had time to reason out the
> possible failure modes, though, or the performance issues.  And it's
> not an itch I personally need to scratch.

Benno Rice, rather than me (although I gave Benno the idea):
https://github.com/jeamland/asciicompat

Managing extra C dependencies is a pain though, and it's a dubious
idea at best, so neither of us seriously pushed for anyone to use it.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia


More information about the Python-ideas mailing list