Grapheme clusters, a.k.a.real characters

Chris Angelico rosuav at gmail.com
Fri Jul 14 05:32:35 EDT 2017


On Fri, Jul 14, 2017 at 6:53 PM, Marko Rauhamaa <marko at pacujo.net> wrote:
> Chris Angelico <rosuav at gmail.com>:
>
>> On Fri, Jul 14, 2017 at 6:15 PM, Marko Rauhamaa <marko at pacujo.net> wrote:
>>> Furthermore, you only dismissed my question about
>>>
>>>    len(text)
>>>
>>> What about
>>>
>>>    text[-1]
>>>    re.match("a.c", text)
>>
>> The considerations and concerns in the second half of my paragraph -
>> the bit you didn't quote - directly address these two.
>
> I guess you refer to:
>
>    These kinds of linguistic considerations shouldn't be codified into
>    the core of the language.

No, I don't. I refer to the second half of the paragraph you quoted
the first half of.

> Then, why bother with Unicode to begin with? Why not just use bytes?
> After all, Python3's strings have the very same pitfalls:
>
>   - you don't know the length of a text in characters
>
>   - chr(n) doesn't return a character
>
>   - you can't easily find the 7th character in a piece of text

First you have to define "character". There are enough different
definitions of "character" (for the purposes of
counting/iteration/subscripting) that at least some of them have to be
separate functions or methods.

>   - you can't compare the equality of two pieces of text
>
>   - you can't use a piece of text as a reliable dict key

(Dict key usage is defined in terms of equality, so these two are the
same concern.)

Yes, you can. For most purposes, textual equality should be defined in
terms of NFC or NFD normalization. Python already gives you that. You
could argue that a string should always be stored NFC (or NFD, take
your pick), and then the equality operator would handle this; but I'm
not sure the benefit is worth it.

And you can't define equality by whether two strings would display
identically, because then you lose semantic information (for instance,
the difference between U+0020 and U+00A0, or between U+2004 and a pair
of U+2006, or between U+004B and U+041A), not to mention the way that
some fonts introduce confusing similarities that other fonts don't.

If you're trying to use strings as identifiers in any way (say, file
names, or document lookup references), using the NFC/NFD normalized
form of the string should be sufficient.

ChrisA



More information about the Python-list mailing list