PEP 393 vs UTF-8 Everywhere

Chris Angelico rosuav at gmail.com
Fri Jan 20 19:58:01 EST 2017


On Sat, Jan 21, 2017 at 11:30 AM, Pete Forman <petef4+usenet at gmail.com> wrote:
> I was asserting that most useful operations on strings start from index
> 0. The r* operations would not be slowed down that much as UTF-8 has the
> useful property that attempting to interpret from a byte that is not at
> the start of a sequence (in the sense of a code point rather than
> Python) is invalid and so quick to move over while working backwards
> from the end.

Let's take one very common example: decoding JSON. A ton of web
servers out there will call json.loads() on user-supplied data. The
bulk of the work is in the scanner, which steps through the string and
does the actual parsing. That function is implemented in Python, so
it's a good example. (There is a C accelerator, but we can ignore that
and look at the pure Python one.)

So, how could you implement this function? The current implementation
maintains an index - an integer position through the string. It
repeatedly requests the next character as string[idx], and can also
slice the string (to check for keywords like "true") or use a regex
(to check for numbers). Everything's clean, but it's lots of indexing.
Alternatively, it could remove and discard characters as they're
consumed. It would maintain a string that consists of all the unparsed
characters. All indexing would be at or near zero, but after every
tiny piece of parsing, the string would get sliced.

With immutable UTF-8 strings, both of these would be O(n^2). Either
indexing is linear, so parsing the tail of the string means scanning
repeatedly; or slicing is linear, so parsing the head of the string
means slicing all the rest away.

The only way for it to be fast enough would be to have some sort of
retainable string iterator, which means exposing an opaque "position
marker" that serves no purpose other than parsing. Every string parse
operation would have to be reimplemented this way, lest it perform
abysmally on large strings. It'd mean some sort of magic "thing" that
probably has a reference to the original string, so you don't get the
progressive RAM refunds that slicing gives, and you'd still have to
deal with lots of the other consequences. It's probably doable, but it
would be a lot of pain.

ChrisA



More information about the Python-list mailing list