Grapheme clusters, a.k.a.real characters

Marko Rauhamaa marko at pacujo.net
Fri Jul 14 02:30:30 EDT 2017


Ben Finney <ben+python at benfinney.id.au>:

> Steve D'Aprano <steve+python at pearwood.info> writes:
>> From time to time, people discover that Python's string algorithms
>> work on code points rather than "real characters", which can lead to
>> anomalies
>
> [...]
>>>> [unicodedata.name(c) for c in reversed(s1)]
> ['LATIN SMALL LETTER X',
>  'LATIN SMALL LETTER E',
>  'LATIN SMALL LETTER A WITH DIAERESIS',
>  'LATIN SMALL LETTER X']
>>>> "".join(reversed(s1))
> 'xeäx'
>>>> [unicodedata.name(c) for c in reversed(s2)]
> ['LATIN SMALL LETTER X',
>  'LATIN SMALL LETTER E',
>  'COMBINING DIAERESIS',
>  'LATIN SMALL LETTER A',
>  'LATIN SMALL LETTER X']
>>>> "".join(reversed(s2))
> 'xëax'

Unicode was supposed to get us out of the 8-bit locale hole. Now it
seems the Unicode hole is far deeper and we haven't reached the bottom
of it yet. I wonder if the hole even has a bottom.

We now have:

 - an encoding: a sequence a bytes

 - a string: a sequence of integers (code points)

 - "a snippet of text": a sequence of characters

Assuming "a sequence of characters" is the final word, and Python wants
to be involved in that business, one must question the usefulness of
strings, which are neither here nor there.

When people use Unicode, they are expecting to be able to deal in real
characters. I would expect:

   len(text)               to give me the length in characters
   text[-1]                to evaluate to the last character
   re.match("a.c", text)   to match a character between a and c

So the question is, should we have a third type for text. Or should the
semantics of strings be changed to be based on characters?


Marko



More information about the Python-list mailing list