Grapheme clusters, a.k.a.real characters

Mikhail V mikhailwas at gmail.com
Sun Jul 16 19:25:48 EDT 2017


>> On Sat, 15 Jul 2017 05:50 pm, Marko Rauhamaa wrote:
>>> Random access to code points is as uninteresting as random access to
>>> UTF-8 bytes. I might want random access to the "Grapheme clusters,
>>> a.k.a.real characters".
>>
>> What _real_ characters are you referring to?
>> If your data has "á" (U00E1), then it is one real character,
>> if you have "a" (U0061) and "ˊ" (U02CA) then it is _two_
>> real characters. So in both cases you have access to code points =
>> real characters.

>It's true that confusion is caused by the ambiguity of the term
>"character."

Yes, but you have said "I might want random access to the "Grapheme clusters,
a.k.a. real characters" and I had impression that you have some concrete
concept of grapheme clusters and some (generally useful) example of
implementation.
Without concrete examples it is just juggling with the terms.

>> But opinions may vary, and in case you prefer or forced to write "á",
>> then it can be impractical to store it as two characters, regardless
>> of encoding.

> Now I'm not following you.

For example, I want to type in cyrillic " рекá " (with an acute accent to denote
the stress on the last vowel, say for a pronunciation tutorial).
Most frequent solution to it would be just typing á instead of a.
And it is indeed most pratical: if I use modifier acute accent
character instead,
then it will be hard to select/paste such text and it will  not render
accurately.

Obvious consequences we have: á is not from the cyrillic code range,
eg. it will break hyphenation rules, and it will look consistent only
if the cyrillic font's "a" has exactly the same look as the latin "a".
Not to tell that it is not always possible to find the glyph with the
'right kind of dirt around'.
For such cases, technically better solution would be using separate
accent character to denote a stroke. In case of font issues it would
at least render as, say an apostrophe.
Still in practice, just typing "á" works better because editors and
even some professional DTP software cannot handle context-based glyph
rendering well.
In other words, I think the internal representation should use
separate modifier character, despite it seems impractical from many
points of view. And it _is_ impractical in case one has such
things as "á" as frequent character in normal writing (the latter
should not be the case for adequate modern writing system though).


Mikhail



More information about the Python-list mailing list