Grapheme clusters, a.k.a.real characters

Rhodri James rhodri at kynesim.co.uk
Fri Jul 14 10:48:56 EDT 2017


On 14/07/17 15:14, Marko Rauhamaa wrote:
> Rhodri James <rhodri at kynesim.co.uk>:
> 
>> On 14/07/17 14:31, Marko Rauhamaa wrote:
>>> Of course, UTF-8 in a bytes object doesn't make the situation any
>>> better, but does it make it any worse?
>>
>> Speaking as someone who has been up to his elbows in this recently, I
>> would say emphatically that it does make things worse. It adds an
>> extra layer of complexity to all of the questions you were asking, and
>> more. A single codepoint is a meaningful thing, even if its meaning
>> may be modified by combining. A single byte may or may not be
>> meaningful.
> 
> I'd like to understand this better. Maybe you have a couple of examples
> to share?

Sure.

What I've mostly been looking at recently has been the Expat XML parser. 
  XML chooses to deal with one of your problems by defining that it's 
not having anything to do with combining, sequences of codepoints are 
all you need to worry about when comparing strings.  U+00E8 (LATIN SMALL 
LETTER E WITH GRAVE) is not the same as U+0065 (LATIN SMALL LETTER E) 
followed by U+0300 (COMBINING GRAVE ACCENT) for example.

However Expat is written in C, and it reads in UTF-8 as a sequence of 
bytes.  There are endless checks all over the code that complete UTF-8 
byte sequences have been read in or passed across functional interfaces. 
  When you are dealing with a bytestream like this, you cannot assume 
that have complete codepoints, and you cannot find codepoint boundaries 
without searching along the string.  It's only once you have 
reconstructed the codepoint that you can tell what sort of character you 
have, and whether or not it is valid in your parsing context.

-- 
Rhodri James *-* Kynesim Ltd



More information about the Python-list mailing list