PEP 393 vs UTF-8 Everywhere

Sat Jan 21 09:44:30 EST 2017

On Sat, 21 Jan 2017 11:45 pm, Tim Chase wrote:

> but I'm hard-pressed to come up with any use case where direct
> indexing into a (non-byte)string makes sense unless you've already
> processed/searched up to that point and can use a recorded index
> from that processing/search.

Let's take a simple example: you do a find to get an offset, and then slice
from that offset.

py> text = "αβγдлфxx"
py> offset = text.find("ф")
py> stuff = text[offset:]
py> assert stuff == "фxx"

That works fine whether indexing refers to code points or bytes.

py> "αβγдлфxx".find("ф")
5
py> "αβγдлфxx".encode('utf-8').find("ф".encode('utf-8'))
10

Either way, you get the expected result. However:

py> stuff = text[offset + 1:]
py> assert stuff == "xx"

That requires indexes to point to the beginning of *code points*, not bytes:
taking byte 11 of "αβγдлфxx".encode('utf-8') drops you into the middle of
the ф representation:

py> "αβγдлфxx".encode('utf-8')[11:]
b'\x84xx'

and it isn't a valid UTF-8 substring. Slicing would generate an exception
unless you happened to slice right at the start of a code point.

It's like seek() and tell() on text files: you cannot seek to arbitrary
positions, but only to the opaque positions returned by tell. That's
unacceptable for strings.

You could avoid that error by increasing the offset by the right amount:

stuff = text[offset + len("ф".encode('utf-8'):]

which is awful. I believe that's what Go and Julia expect you to do.

Another solution would be to have the string slicing method automatically
scan forward to the start of the next valid UTF-8 code point. That would be
the "Do What I Mean" solution.

The problem with the DWIM solution is that not only is it adding complexity,
but it's frankly *weird*. It would mean:

- if the character at position `offset` fits in 2 bytes:
  text[offset+1:] == text[offset+2:]

- if it fits in 3 bytes:
  text[offset+1:] == text[offset+2:] == text[offset+3:]

- and if it fits in 4 bytes:
  text[offset+1:] == text[offset+2:] == text[offset+3:] == text[offset+4:]

Having the string slicing method Do The Right Thing would actually be The
Wrong Thing. It would make it awful to reason about slicing.

You can avoid this by having the interpreter treat the Python-level indexes
as opaque "code point offsets", and converting them to and from "byte
offsets" as needed. That's not even very hard. But it either turns every
indexing into O(N) (since you have to walk the string to count which byte
represents the nth code point), or you have to keep an auxiliary table with
every string, letting you convert from byte indexes to code point indexes
quickly, but that will significantly increase the memory size of every
string, blowing out the advantage of using UTF-8 in the first place.

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.