PEP 393 vs UTF-8 Everywhere

Sat Jan 21 14:58:46 EST 2017

On 2017-01-22 01:44, Steve D'Aprano wrote:
> On Sat, 21 Jan 2017 11:45 pm, Tim Chase wrote:
> 
> > but I'm hard-pressed to come up with any use case where direct
> > indexing into a (non-byte)string makes sense unless you've already
> > processed/searched up to that point and can use a recorded index
> > from that processing/search.
> 
> 
> Let's take a simple example: you do a find to get an offset, and
> then slice from that offset.
> 
> py> text = "αβγдлфxx"
> py> offset = text.find("ф")

Right, so here, you've done a (likely linear, but however you get
here) search, which then makes sense to use this opaque "offset"
token for slicing purposes:

> py> stuff = text[offset:]
> py> assert stuff == "фxx"

> That works fine whether indexing refers to code points or bytes.
> 
> py> "αβγдлфxx".find("ф")
> 5
> py> "αβγдлфxx".encode('utf-8').find("ф".encode('utf-8'))
> 10
> 
> Either way, you get the expected result. However:
> 
> py> stuff = text[offset + 1:]
> py> assert stuff == "xx"
>
> That requires indexes to point to the beginning of *code points*,
> not bytes: taking byte 11 of "αβγдлфxx".encode('utf-8') drops you
> into the middle of the ф representation:
> 
> py> "αβγдлфxx".encode('utf-8')[11:]
> b'\x84xx'
> 
> and it isn't a valid UTF-8 substring. Slicing would generate an
> exception unless you happened to slice right at the start of a code
> point.

Right.  It gets even weirder (edge-case'ier) when dealing with
combining characters:

>>> s = "man\N{COMBINING TILDE}ana"
>>> for i, c in enumerate(s): print("%i: %s" % (i, c))
... 
0: m
1: a
2: n
3:˜
4: a
5: n
6: a
>>> ''.join(reversed(s))
'anãnam'

Offsetting s[3:] produces a (sub)string that begins with a combining
character that doesn't have anything preceding it to combine with.

> It's like seek() and tell() on text files: you cannot seek to
> arbitrary positions, but only to the opaque positions returned by
> tell. That's unacceptable for strings.

I'm still unclear on *why* this would be considered unacceptable for
strings.  It makes sense when dealing with byte-strings, since they
contain binary data that may need to get sliced at arbitrary
offsets.  But for strings, slicing only makes sense (for every
use-case I've been able to come up with) in the context of known
offsets like you describe with tell().  The cost of not using opaque
tell()like offsets is, as you describe, slicing in the middle of
characters.

> You could avoid that error by increasing the offset by the right
> amount:
> 
> stuff = text[offset + len("ф".encode('utf-8'):]
> 
> which is awful. I believe that's what Go and Julia expect you to do.

It may be awful, but only because it hasn't been pythonified.  If the
result from calling .find() on a string returns a "StringOffset"
object, then it would make sense that its __add__/__radd__ methods
would accept an integer and to such translation for you.

> You can avoid this by having the interpreter treat the Python-level
> indexes as opaque "code point offsets", and converting them to and
> from "byte offsets" as needed. That's not even very hard. But it
> either turns every indexing into O(N) (since you have to walk the
> string to count which byte represents the nth code point)

The O(N) cost has to be paid at some point, but I'd put forth that
other operations like .find() already pay that O(N) cost and can
return an opaque "offset token" that can be subsequently used for O(1)
indexing (multiple times if needed).

-tkc