Grapheme clusters, a.k.a.real characters

Mikhail V mikhailwas at gmail.com
Sat Jul 15 18:38:42 EDT 2017


On Sat, 15 Jul 2017 05:50 pm, Marko Rauhamaa wrote:
> Random access to code points is as uninteresting as random access to
> UTF-8 bytes.
> I might want random access to the "Grapheme clusters, a.k.a.real
> characters".

What _real_ characters are you referring to?
If your data has "á" (U00E1), then it is one real character,
if you have "a" (U0061) and "ˊ" (U02CA) then it is _two_
real characters. So in both cases you have access to code points =
real characters.

For metaphysical discussion -  in _my_ definition there
is no such "real" character as "á", since it is the "a" glyph with some dirt,
so according to my definition, it should be two separate characters,
both semantically and technically seen.

And, in my definition, the whole Unicode is a huge junkyard, to start with.

But opinions may vary, and in case you prefer or forced to write "á",
then it can be impractical to store it as two characters, regardless of
encoding.



Mikhail



More information about the Python-list mailing list