PEP 393 vs UTF-8 Everywhere

Sat Jan 21 04:14:20 EST 2017

Chris Angelico <rosuav at gmail.com> writes:
> You can't do a look-ahead with a vanilla string iterator. That's
> necessary for a lot of parsers.

For JSON?  For other parsers you usually have a tokenizer that reads
characters with maybe 1 char of lookahead.

> Yes, which gives a two-level indexing (first find the strand, then the
> character), and that's going to play pretty badly with CPU caches.

If you're jumping around at random all over the string, you probably
really want a bytearray rather than a unicode string.  If you're
scanning sequentually you won't have to look at the outer table very
often.