Grapheme clusters, a.k.a.real characters

Marko Rauhamaa marko at pacujo.net
Tue Jul 18 13:01:10 EDT 2017


Chris Angelico <rosuav at gmail.com>:

> what you're more likely to want is "match the letter á", and you don't
> care whether it's represented as U+0061 U+0301 or as U+00E1. That's
> where Unicode normalization comes in.

Yes. Also, not every letter can be normalized to a single codepoint so
NFC is not a way out. For example,

    re.match("^[q̈]$", "q̈")

returns None regardless of normalization.


Marko



More information about the Python-list mailing list