[Python-ideas] string codes & substring equality
spir
denis.spir at gmail.com
Thu Nov 28 13:39:39 CET 2013
On 11/28/2013 07:05 AM, Chris Angelico wrote:
> Sure, but the point is still there. I picked up an extreme example by
> pointing to PHP, but it's still the same thing: the startswith
> function, given more parameters, is effectively equivalent to slicing
> and comparing.
Rather, it is equivalent to comparing an interval without slicing. That's the
whole point.
* This is a semantic (conceptual) difference, because the operation is about
comparing, not slicing.
* This a performance difference, relevant in this case because such comparisons
form the core of a scanning/parsing process; every higher-level maching pattern
is a composition of such low level substring comps and a little set of
operations on codes. For individual, isolated opreations, I would not care.
In other words, startswith with a start index just does the right thing;
unfortunately, its name does not say it, instead it misleads (me, at least).
> What is gained by having the method do both jobs in one
> wrapper? In this case, the answer might be performance, or it might be
> readability, and both can be argued. But it's certainly not a glaring
> hole; if startswith could ONLY check the beginning of a string, the
> push to _add_ this feature would be quite weak.
Yes.
> >Especially for Unicode, where a character isn't a byte, but an abstract code point that can be represented as at least three different variable-length sequences, taking up to 6 bytes.
> No, a character is simply an integer. How it's represented is
> immaterial. The easiest representation in Python is a straight int,
> the easiest in C is probably also an int (32-bit; if it's 64-bit, you
> waste 40-odd bits, but it's still easiest); the variable length byte
> representations are for transmission/storage, not for manipulation.
Right. Except the representation of characters properly speaking (rather than in
the weird and polysemic Unicode sense) is a far more complicated issue. As you
certainly know. Else, many other languages would probably have a decoded
representation for textual data as a code string, like Python has. But this
representation, intermediate between byte string and character string, is only
the starting point of solving issues of character representation. To have a
string of chars, in both everyday and traditional computing senses, one then
needs to group codes into character "piles", normalise (NFD to avoid losing
info) them, then sort codes inside these code piles. At this cost, one has a
bi-univoque string of char reprs.
I did this once (for and in language D). It's possible to have it efficient (2-3
time the cost of decoding), but remains a big computing task.
Some of the issues can be illustrated by:
s1 = "\u0062\u0069\u0308\u0062\u0069\u0302" # primary, decomposed repr of "bïbî"
s2 = "\u0062\u00EF\u0062\u00EE" # precomposed repr of "bïbî"
print(s1, s2) # bïbî bïbî -- all right!
assert(s1.find("i") == 1) # incorrect:
# there is here no representation of the character "i",
# but a base code (a base mark), part of an actual char representation
assert(s1.find("ï") == -1) # incorrect:
# "\u0069\u0308" is the primary Unicode representation of "ï"
assert(s1.find("\u0069\u0308") == 1) # correct:
# (no comment)
assert(s1.find("\u00EF") == -1) # incorrect:
# this is another, precomposed repr of "ï"
assert(s2.find("i") == -1) # correct:
# no character "i" here
assert(s2.find("\u00EF") == 1) # correct:
# (no comment)
assert(s2.find("ï") == 1) # correct:
# there is a precomposed repr of "i"
assert(s2.find("\u0069\u0308") == -1) # incorrect :
# "\u0069\u0308" is the primary Unicode representation of "ï"
assert(s2.find("\u0308") == -1) # problematic :
# how do I know there is here a char with an umlaut '¨' ?
# see also https://en.wikipedia.org/wiki/Unicode_equivalence
Denis
More information about the Python-ideas
mailing list