[Python-ideas] string codes & substring equality

Thu Nov 28 14:22:21 CET 2013

On 28 November 2013 12:39, spir <denis.spir at gmail.com> wrote:
> Right. Except the representation of characters properly speaking (rather
> than in the weird and polysemic Unicode sense) is a far more complicated
> issue. As you certainly know. Else, many other languages would probably have
> a decoded representation for textual data as a code string, like Python has.
> But this representation, intermediate between byte string and character
> string, is only the starting point of solving issues of character
> representation. To have a string of chars, in both everyday and traditional
> computing senses, one then needs to group codes into character "piles",
> normalise (NFD to avoid losing info) them, then sort codes inside these code
> piles. At this cost, one has a bi-univoque string of char reprs.
> I did this once (for and in language D). It's possible to have it efficient
> (2-3 time the cost of decoding), but remains a big computing task.
>
> Some of the issues can be illustrated by:
>
> s1 = "\u0062\u0069\u0308\u0062\u0069\u0302" # primary, decomposed repr of
> "bïbî"
> s2 = "\u0062\u00EF\u0062\u00EE"             # precomposed repr of "bïbî"
> print(s1, s2)                               # bïbî bïbî -- all right!
>
>
> assert(s1.find("i") == 1)                   # incorrect:
> # there is here no representation of the character "i",
> # but a base code (a base mark), part of an actual char representation

My eyes glaze over at this level of Unicode, but shouldn't you be
looking at the stuff in the unicodedata module? And possibly even some
external 3rd party Unicode handling modules (if they exist)? I didn't
think that Python handled the fancier levels of Unicode normalisation,
collation, etc, as part of the native string type. Or ever claimed to.

Paul