Grapheme clusters, a.k.a.real characters

Steve D'Aprano steve+python at pearwood.info
Tue Jul 18 22:07:56 EDT 2017


On Wed, 19 Jul 2017 12:09 am, Random832 wrote:

> On Fri, Jul 14, 2017, at 08:33, Chris Angelico wrote:
>> What do you mean about regular expressions? You can use REs with
>> normalized strings. And if you have any valid definition of "real
>> character", you can use it equally on an NFC-normalized or
>> NFD-normalized string than any other. They're just strings, you know.
> 
> I don't understand how normalization is supposed to help with this. It's
> not like there aren't valid combinations that do not have a
> corresponding single NFC codepoint (to say nothing of the situation with
> e.g. Indic languages).

Normalisation helps. Suppose you want to search for é for example, a naive
regular expression engine will only find the exact representation you or your
editor happened to use:

U+00E9 LATIN SMALL LETTER E WITH ACUTE

or 

U+0065 LATIN SMALL LETTER E + U+0301 COMBINING ACUTE ACCENT


but not both. By normalising, you ensure that both the text you are searching
and the regex you are searching for are in the same state: either composed to a
single code point U+00E9 or decomposed to two U+0065,0301 but never one in one
state and the other in the other.

For characters that don't include a canonical composition form, then there's no
problem: you will always be searching for a decomposed character using a base
character followed by combining characters, so there is no discrepancy and it
will just work.


> In principle probably a viable solution for regex would be to add
> character classes for base and combining characters, and then
> "[[:base:]][[:combining:]]*" can be used as a building block if
> necessary.

I don't know what that means.

Any code point (except for combining characters themselves) can be used as the
base, and the various kinds of combining characters have the Unicode category
property:

Mn (Mark, nonspacing)
Mc (Mark, spacing combining)
Me (Mark, enclosing)

If we're talking about combining accents and diacritics, the one we want is Mc.

But generally, we're not after "any old diacritic", we're after a specific one,
on a specific base.




-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.




More information about the Python-list mailing list