Grapheme clusters, a.k.a.real characters

Marko Rauhamaa marko at pacujo.net
Tue Jul 18 14:31:21 EDT 2017


Chris Angelico <rosuav at gmail.com>:

> On Wed, Jul 19, 2017 at 3:01 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
>> Yes. Also, not every letter can be normalized to a single codepoint so
>> NFC is not a way out. For example,
>>
>>     re.match("^[q̈]$", "q̈")
>>
>> returns None regardless of normalization.
>
> In what language or context would you actually want to do this?

I could have picked more realistic examples: Classic Greek or Hebrew,
for example.

However, someone might actually use even "q̈" in a real setting. First of
all, it *is* a legal character. Secondly, people sometimes combine
characters in an ad-hoc fashion. Thirdly, remember the case of
Esperanto, which blessed the world with the letters

   ĉ ĝ ĥ ĵ ŝ ŭ

Esperanto's venerable history finally awarded those characters a
code-point status in Unicode. However, around the year 2000, it was
still commonplace to use all sorts of tricks to type them on the
Internet:

   ch gh hh jj sh u

   ^c ^g ^h ^j ^s ^u

   cx gx hx jx sx ux

For all we know, someone somewhere might be cooking up a language that
depends on "q̈".


Marko



More information about the Python-list mailing list