Grapheme clusters, a.k.a.real characters

Marko Rauhamaa marko at pacujo.net
Fri Jul 14 15:00:03 EDT 2017


Rhodri James <rhodri at kynesim.co.uk>:

> On 14/07/17 15:14, Marko Rauhamaa wrote:
>> I'd like to understand this better. Maybe you have a couple of
>> examples to share?
>
> Sure.
>
> What I've mostly been looking at recently has been the Expat XML parser.
> XML chooses to deal with one of your problems by defining that it's not
> having anything to do with combining, sequences of codepoints are all
> you need to worry about when comparing strings.  U+00E8 (LATIN SMALL
> LETTER E WITH GRAVE) is not the same as U+0065 (LATIN SMALL LETTER E)
> followed by U+0300 (COMBINING GRAVE ACCENT) for example.

Very interesting. The relevant W3C spec confirms what you said:

  5. Test the resulting sequences of code points bit-by-bit for identity.

  [...]

  This document therefore recommends, when possible, that all content be
  stored and exchanged in Unicode Normalization Form C (NFC).

  <URL: https://www.w3.org/TR/charmod-norm/>


Marko



More information about the Python-list mailing list