Grapheme clusters, a.k.a.real characters

Wed Jul 19 19:36:42 EDT 2017

On Wednesday, July 19, 2017 at 1:57:47 AM UTC-5, Steven D'Aprano wrote:
> On Wed, 19 Jul 2017 17:51:49 +1200, Gregory Ewing wrote:
> 
> > Chris Angelico wrote:
> >> Once you NFC or NFD normalize both strings, identical strings will
> >> generally have identical codepoints... You should then be able to use
> >> normal regular expressions to match correctly.
> > 
> > Except that if you want to match a set of characters,
> > you can't reliably use [...], you would have to write them out as
> > alternatives in case some of them take up more than one code point.
> 
> Good point!
> 
> A quibble -- there's no "in case" here, since you, the
> programmer, will always know whether they have a single
> code point form or not. If you're unsure, look it up, or
> call unicodedata.normalize().
> 
> (Yeah, right, like the average coder will remember to do this...)
> 
> Nevertheless, although it might be annoying and tricky,
> regexes *are* flexible enough to deal with this problem.
> After all, you can't use [th] to match "th" as a unit
> either, and regex set character set notation [abcd] is
> logically equivalent to (a|b|c|d).

If the intention is to match the two-character-string "th",
then the obvious solution would be to wrap the substring
into a matching or non-matching group:

    pattern = r'(?:th)'

Though i suppose one could abuse the character-set syntax by
doing something like:

    pattern = r'[t][h]'

However, even the first example (using a group) is
superfluous if "th" is the only substring to be matched.
Employing the power of grouping is only necessary in more
complex patterns.