PEP 393 vs UTF-8 Everywhere

Paul Rubin no.email at nospam.invalid
Sat Jan 21 01:01:09 EST 2017


Chris Angelico <rosuav at gmail.com> writes:
> decoding JSON... the scanner, which steps through the string and
> does the actual parsing. ...
> The only way for it to be fast enough would be to have some sort of
> retainable string iterator, which means exposing an opaque "position
> marker" that serves no purpose other than parsing.

Python already has that type of iterator:
   x = "foo"
   for c in x: ....

> It'd mean some sort of magic "thing" that probably has a reference to
> the original string

It's a regular old string iterator unless I'm missing something.  Of
course a json parser should use it, though who uses the non-C json
parser anyway these days?

[Chris Kaynor writes:]
> rfind/rsplit/rindex/rstrip and the other related reverse
> functions would require walking the string from start to end, rather
> than short-circuiting by reading from right to left. 

UTF-8 can be read from right to left because you can recognize when a
codepoint begins by looking at the top 2 bits of each byte as you scan
backwards.  Any combination except for 11 is a leading byte, and 11 is
always a continuation byte.  This "prefix property" of UTF8 is a design
feature and not a trick someone noticed after the fact.

Also if you really want O(1) random access, you could put an auxiliary
table into long strings, giving the byte offset of every 256th codepoint
or something like that.  Then you'd go to the nearest table entry and
scan from there.  This would usually be in-cache scanning so quite fast.
Or use the related representation of "ropes" which are also very easy to
concatenate if they can be nested.  Erlang does something like that
with what it calls "binaries".



More information about the Python-list mailing list