[Python-ideas] string codes & substring equality

Thu Nov 28 13:39:39 CET 2013

On 11/28/2013 07:05 AM, Chris Angelico wrote:
> Sure, but the point is still there. I picked up an extreme example by
> pointing to PHP, but it's still the same thing: the startswith
> function, given more parameters, is effectively equivalent to slicing
> and comparing.

Rather, it is equivalent to comparing an interval without slicing. That's the 
whole point.
* This is a semantic (conceptual) difference, because the operation is about 
comparing, not slicing.
* This a performance difference, relevant in this case because such comparisons 
form the core of a scanning/parsing process; every higher-level maching pattern 
is a composition of such low level substring comps and a little set of 
operations on codes. For individual, isolated opreations, I would not care.

In other words, startswith with a start index just does the right thing; 
unfortunately, its name does not say it, instead it misleads (me, at least).

>  What is gained by having the method do both jobs in one
> wrapper? In this case, the answer might be performance, or it might be
> readability, and both can be argued. But it's certainly not a glaring
> hole; if startswith could ONLY check the beginning of a string, the
> push to _add_ this feature would be quite weak.

Yes.

> >Especially for Unicode, where a character isn't a byte, but an abstract code point that can be represented as at least three different variable-length sequences, taking up to 6 bytes.
> No, a character is simply an integer. How it's represented is
> immaterial. The easiest representation in Python is a straight int,
> the easiest in C is probably also an int (32-bit; if it's 64-bit, you
> waste 40-odd bits, but it's still easiest); the variable length byte
> representations are for transmission/storage, not for manipulation.

Right. Except the representation of characters properly speaking (rather than in 
the weird and polysemic Unicode sense) is a far more complicated issue. As you 
certainly know. Else, many other languages would probably have a decoded 
representation for textual data as a code string, like Python has. But this 
representation, intermediate between byte string and character string, is only 
the starting point of solving issues of character representation. To have a 
string of chars, in both everyday and traditional computing senses, one then 
needs to group codes into character "piles", normalise (NFD to avoid losing 
info) them, then sort codes inside these code piles. At this cost, one has a 
bi-univoque string of char reprs.
I did this once (for and in language D). It's possible to have it efficient (2-3 
time the cost of decoding), but remains a big computing task.

Some of the issues can be illustrated by:

s1 = "\u0062\u0069\u0308\u0062\u0069\u0302" # primary, decomposed repr of "bïbî"
s2 = "\u0062\u00EF\u0062\u00EE"             # precomposed repr of "bïbî"
print(s1, s2)                               # bïbî bïbî -- all right!

assert(s1.find("i") == 1)                   # incorrect:
# there is here no representation of the character "i",
# but a base code (a base mark), part of an actual char representation

assert(s1.find("ï") == -1)                  # incorrect:
# "\u0069\u0308" is the primary Unicode representation of "ï"

assert(s1.find("\u0069\u0308") == 1)        # correct:
# (no comment)

assert(s1.find("\u00EF") == -1)             # incorrect:
# this is another, precomposed repr of "ï"

assert(s2.find("i") == -1)                  # correct:
# no character "i" here

assert(s2.find("\u00EF") == 1)              # correct:
# (no comment)

assert(s2.find("ï") == 1)                   # correct:
# there is a precomposed repr of "i"

assert(s2.find("\u0069\u0308") == -1)       # incorrect :
# "\u0069\u0308" is the primary Unicode representation of "ï"

assert(s2.find("\u0308") == -1)             # problematic :
# how do I know there is here a char with an umlaut '¨' ?

# see also https://en.wikipedia.org/wiki/Unicode_equivalence

Denis