Unicode and Python - how often do you index strings?

Tue Jun 3 22:13:45 EDT 2014

On Wed, Jun 4, 2014 at 11:18 AM, Roy Smith <roy at panix.com> wrote:
> In article <mailman.10656.1401842403.18130.python-list at python.org>,
>  Chris Angelico <rosuav at gmail.com> wrote:
>
>> A current discussion regarding Python's Unicode support centres (or
>> centers, depending on how close you are to the cent[er]{2} of the
>> universe)
>
> <sarcasm style="regex-pedant">Um, you mean cent(er|re), don't you?  The
> pattern you wrote also matches centee and centrr.</sarcasm>

Maybe there's someone who spells it that way! Let's not be excluding
people. That'd be rude.

>> around one critical question: Is string indexing common?
>
> Not in our code.  I've got 80008 non-blank lines of Python (2.7) source
> handy.  I tried a few heuristics to find patterns which might be string
> indexing.
>
> $ find . -name '*.py' | xargs egrep '\[[^]][0-9]+\]'
>
> and then looked them over manually.  I see this pattern a bunch of times
> (in a single-use script):
>
> data['shard_key'] = hashlib.md5(str(id)).hexdigest()[:4]

Slicing is a form of indexing too, although in this case (slicing from
the front) it could be implemented on top of UTF-8 without much
problem.

> withhyphen = number if '-' in number else (number[:-2] + '-' +
> number[-2:]) # big assumption here

This *definitely* counts; if strings were represented internally in
UTF-8, this would involve two scans (although a smart implementation
could probably count backward rather than forward). By the way, any
time you slice up to the third from the end, you win two extra awesome
points, just for putting [:-3] into your code and having it mean
something. But I digress.

> Anyway, there's a bunch more, but the bottom line is that in our code,
> indexing into a string (at least explicitly in application source code)
> is a pretty rare thing.

Thanks. Of course, the pattern you searched for is looking only for
literals; it's a bit harder to find cases where the index (or slice
position) comes from a variable or expression, and those situations
are also rather harder to optimize (the MD5 prefix is clearly better
scanned from the front, the number tail is clearly better scanned from
the back - but with a variable?).

ChrisA