PEP 393 vs UTF-8 Everywhere

Sat Jan 21 07:45:46 EST 2017

On 2017-01-21 11:58, Chris Angelico wrote:
> So, how could you implement this function? The current
> implementation maintains an index - an integer position through the
> string. It repeatedly requests the next character as string[idx],
> and can also slice the string (to check for keywords like "true")
> or use a regex (to check for numbers). Everything's clean, but it's
> lots of indexing.

But in these parsing cases, the indexes all originate from stepping
through the string from the beginning and processing it
codepointwise.  Even this is a bit of an oddity, especially once you
start taking combining characters into consideration and need to
process them with the preceding character(s).  So while you may be
doing indexing, those indexes usually stem from having walked to that
point, not arbitrarily picking some offset.

You allude to it in your:

> The only way for it to be fast enough would be to have some sort of
> retainable string iterator, which means exposing an opaque "position
> marker" that serves no purpose other than parsing. Every string
> parse operation would have to be reimplemented this way, lest it
> perform abysmally on large strings. It'd mean some sort of magic
> "thing" that probably has a reference to the original string, so
> you don't get the progressive RAM refunds that slicing gives, and
> you'd still have to deal with lots of the other consequences. It's
> probably doable, but it would be a lot of pain.

but I'm hard-pressed to come up with any use case where direct
indexing into a (non-byte)string makes sense unless you've already
processed/searched up to that point and can use a recorded index
from that processing/search.

Can you provide real-world examples of "I need character 2832 from
this string of unicode text, but I never had to scan to that point
linearly from the beginning/end of the string"?

-tkc