Grapheme clusters, a.k.a.real characters

Random832 random832 at fastmail.com
Tue Jul 18 10:09:59 EDT 2017


On Fri, Jul 14, 2017, at 08:33, Chris Angelico wrote:
> What do you mean about regular expressions? You can use REs with
> normalized strings. And if you have any valid definition of "real
> character", you can use it equally on an NFC-normalized or
> NFD-normalized string than any other. They're just strings, you know.

I don't understand how normalization is supposed to help with this. It's
not like there aren't valid combinations that do not have a
corresponding single NFC codepoint (to say nothing of the situation with
e.g. Indic languages).

In principle probably a viable solution for regex would be to add
character classes for base and combining characters, and then
"[[:base:]][[:combining:]]*" can be used as a building block if
necessary.



More information about the Python-list mailing list