Grapheme clusters, a.k.a.real characters

Wed Jul 19 04:08:46 EDT 2017

On Wed, Jul 19, 2017 at 4:49 PM, Steven D'Aprano <steve at pearwood.info> wrote:
> The *really* tricky part is if you receive a string from the user
> intended as a regular expression. If they provide
>
> [xyzã]
>
> as part of a regex, and you receive ã in denormalized form
>
> U+0061 LATIN SMALL LETTER A + U+0303 COMBINING TILDE
>
> you can't be sure that they actually intended:
>
> U+00E3 LATIN SMALL LETTER A WITH TILDE
>
> maybe they're smarter than you think and they actually do mean
>
> [xyza\N{COMBINING TILDE}] = (x|y|z|a|\N{COMBINING TILDE})

To be quite honest, I wouldn't care about that possibility. If I could
design regex semantics purely from an idealistic POV, I would say that
[xyzã], regardless of its encoding, will match any of the four
characters "x", "y", "z", "ã".

Earlier I posted a suggestion that a folding function be used when
searching (for instance, it can case fold, NFKC normalize, etc).
Unfortunately, this makes positional matching extremely tricky; if
normalization changes the number of code points in the string, you
have some fiddly work to do to try to find back the match location in
the original (pre-folding) string. That technique works well for
simple lookups (eg "find me all documents whose titles contain <this
string>"), but a regex does more than that. As such, I am in favour of
the regex engine defining a "character" as a base with all subsequent
combining, so a single dot will match the entire combined character,
and square bracketed expressions have the same meaning whether you're
NFC or NFD normalized, or not normalized. However, that's the ideal
situation, and I'm not sure (a) whether it's even practical to do
that, and (b) how bad it would be in terms of backward compatibility.

ChrisA