Grapheme clusters, a.k.a.real characters

Chris Angelico rosuav at gmail.com
Tue Jul 18 14:46:21 EDT 2017


On Wed, Jul 19, 2017 at 4:31 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
> Chris Angelico <rosuav at gmail.com>:
>
>> On Wed, Jul 19, 2017 at 3:01 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
>>> Yes. Also, not every letter can be normalized to a single codepoint so
>>> NFC is not a way out. For example,
>>>
>>>     re.match("^[q̈]$", "q̈")
>>>
>>> returns None regardless of normalization.
>>
>> In what language or context would you actually want to do this?
>
> I could have picked more realistic examples: Classic Greek or Hebrew,
> for example.
>
> However, someone might actually use even "q̈" in a real setting. First of
> all, it *is* a legal character. Secondly, people sometimes combine
> characters in an ad-hoc fashion. Thirdly, remember the case of
> Esperanto, which blessed the world with the letters
>
>    ĉ ĝ ĥ ĵ ŝ ŭ
>
> Esperanto's venerable history finally awarded those characters a
> code-point status in Unicode. However, around the year 2000, it was
> still commonplace to use all sorts of tricks to type them on the
> Internet:
>
>    ch gh hh jj sh u
>
>    ^c ^g ^h ^j ^s ^u
>
>    cx gx hx jx sx ux
>
> For all we know, someone somewhere might be cooking up a language that
> depends on "q̈".

Sure. And if they do, they'll have to contend with the fact that it's
going to be represented as multiple code units.

What I *think* you're asking for is for square brackets in a regex to
count combining characters with their preceding base character. That
would make a lot of sense, and would actually be a reasonable feature
to request. (Probably as an option, in case there's a backward
compatibility issue.)

ChrisA



More information about the Python-list mailing list