PEP 393 vs UTF-8 Everywhere

Sat Jan 21 01:23:02 EST 2017

On Sat, Jan 21, 2017 at 5:01 PM, Paul Rubin <no.email at nospam.invalid> wrote:
> Chris Angelico <rosuav at gmail.com> writes:
>> decoding JSON... the scanner, which steps through the string and
>> does the actual parsing. ...
>> The only way for it to be fast enough would be to have some sort of
>> retainable string iterator, which means exposing an opaque "position
>> marker" that serves no purpose other than parsing.
>
> Python already has that type of iterator:
>    x = "foo"
>    for c in x: ....
>
>> It'd mean some sort of magic "thing" that probably has a reference to
>> the original string
>
> It's a regular old string iterator unless I'm missing something.  Of
> course a json parser should use it, though who uses the non-C json
> parser anyway these days?

You can't do a look-ahead with a vanilla string iterator. That's
necessary for a lot of parsers.

> Also if you really want O(1) random access, you could put an auxiliary
> table into long strings, giving the byte offset of every 256th codepoint
> or something like that.  Then you'd go to the nearest table entry and
> scan from there.  This would usually be in-cache scanning so quite fast.
> Or use the related representation of "ropes" which are also very easy to
> concatenate if they can be nested.  Erlang does something like that
> with what it calls "binaries".

Yes, which gives a two-level indexing (first find the strand, then the
character), and that's going to play pretty badly with CPU caches. I'd
be curious to know how an alternate Python with that implementation
would actually perform.

ChrisA