Unicode and Python - how often do you index strings?

Wed Jun 4 06:10:41 EDT 2014

Mark Lawrence wrote:

> On 04/06/2014 01:39, Chris Angelico wrote:
>> A current discussion regarding Python's Unicode support centres (or
>> centers, depending on how close you are to the cent[er]{2} of the
>> universe) around one critical question: Is string indexing common?
>>
>> Python strings can be indexed with integers to produce characters
>> (strings of length 1). They can also be iterated over from beginning
>> to end. Lots of operations can be built on either one of those two
>> primitives; the question is, how much can NOT be implemented
>> efficiently over iteration, and MUST use indexing? Theories are great,
>> but solid use-cases are better - ideally, examples from actual
>> production code (actual code optional).
>>
>> I know the collective experience of python-list can't fail to bring up
>> a few solid examples here :)
>>
>> Thanks in advance, all!!
>>
>> ChrisA
>>
> 
> Single characters quite often, iteration rarely if ever, slicing all the
> time, but does that last one count?

The indices used for slicing typically don't come out of nowhere. A simple 
example would be

def strip_prefix(text, prefix):
    if text.startswith(prefix):
        text = text[len(prefix):] 
    return text

If both prefix and text use UTF-8 internally the byte offset is already 
known. The question is then how we can preserve that information.

The first approach that comes to mind is an int subtype:

>>> for i, c in enumerate("123αλφα"):
...     print(i, byteoffset(i), c)
... 
0 0 1
1 1 2
2 2 3
3 3 α
4 5 λ
5 7 φ
6 9 α

This would work in the strip_prefix() example, but lead to data corruption 
in most other cases unless limited to a specific string -- in which case it 
would no longer work with strip_prefix().

So a new interface would be needed. My second try, an object with two byte 
offsets linked to a specific string:

>>> span("foobar").startswith("oob")
>>> p = span("foobar").startswith("foo")
>>> p.replace("baz")
'bazbar'
>>> p.before()
''
>>> p.after()
'bar'
>>> span("foo bar baz").find("bar").replace("spam")
'foo spam bar'

I have no idea if that could work out...