PEP 393 vs UTF-8 Everywhere

Sat Jan 21 01:38:33 EST 2017

Chris Angelico writes:

> On Sat, Jan 21, 2017 at 11:30 AM, Pete Forman wrote:

>> I was asserting that most useful operations on strings start from
>> index 0. The r* operations would not be slowed down that much as
>> UTF-8 has the useful property that attempting to interpret from a
>> byte that is not at the start of a sequence (in the sense of a code
>> point rather than Python) is invalid and so quick to move over while
>> working backwards from the end.
>
> Let's take one very common example: decoding JSON. A ton of web
> servers out there will call json.loads() on user-supplied data. The
> bulk of the work is in the scanner, which steps through the string and
> does the actual parsing. That function is implemented in Python, so
> it's a good example. (There is a C accelerator, but we can ignore that
> and look at the pure Python one.)
>
> So, how could you implement this function? The current implementation
> maintains an index - an integer position through the string. It
> repeatedly requests the next character as string[idx], and can also
> slice the string (to check for keywords like "true") or use a regex
> (to check for numbers). Everything's clean, but it's lots of indexing.
> Alternatively, it could remove and discard characters as they're
> consumed. It would maintain a string that consists of all the unparsed
> characters. All indexing would be at or near zero, but after every
> tiny piece of parsing, the string would get sliced.
>
> With immutable UTF-8 strings, both of these would be O(n^2). Either
> indexing is linear, so parsing the tail of the string means scanning
> repeatedly; or slicing is linear, so parsing the head of the string
> means slicing all the rest away.
>
> The only way for it to be fast enough would be to have some sort of
> retainable string iterator, which means exposing an opaque "position
> marker" that serves no purpose other than parsing. Every string parse
> operation would have to be reimplemented this way, lest it perform
> abysmally on large strings. It'd mean some sort of magic "thing" that
> probably has a reference to the original string, so you don't get the
> progressive RAM refunds that slicing gives, and you'd still have to
> deal with lots of the other consequences. It's probably doable, but it
> would be a lot of pain.

Julia does this. It has immutable UTF-8 strings, and there is a JSON
parser. The "opaque position marker" is just the byte index. An attempt
to use an invalid index throws an error. A substring type points to an
underlying string. An iterator, called graphemes, even returns
substrings that correspond to what people might consider a character.

I offer Julia as evidence.

My impression is that Julia's UTF-8-based system works and is not a
pain. I wrote a toy function once to access the last line of a large
memory-mapped text file, so I have just this little bit of personal
experience of it, so far. Incidentally, can Python memory-map a UTF-8
file as a string?

http://docs.julialang.org/en/stable/manual/strings/
https://github.com/JuliaIO/JSON.jl