Grapheme clusters, a.k.a.real characters

Marko Rauhamaa marko at pacujo.net
Sun Jul 16 03:55:17 EDT 2017


Mikhail V <mikhailwas at gmail.com>:

> On Sat, 15 Jul 2017 05:50 pm, Marko Rauhamaa wrote:
>> Random access to code points is as uninteresting as random access to
>> UTF-8 bytes. I might want random access to the "Grapheme clusters,
>> a.k.a.real characters".
>
> What _real_ characters are you referring to?
> If your data has "á" (U00E1), then it is one real character,
> if you have "a" (U0061) and "ˊ" (U02CA) then it is _two_
> real characters. So in both cases you have access to code points =
> real characters.

It's true that confusion is caused by the ambiguity of the term
"character."

> For metaphysical discussion - in _my_ definition there is no such
> "real" character as "á", since it is the "a" glyph with some dirt, so
> according to my definition, it should be two separate characters, both
> semantically and technically seen.

Here's the problem: when the human user types in "á" (with one, two or
three keyclicks), they don't know how the computer represents it
internally. The Unicode standard allows for two *equivalent* code point
sequences (<URL: https://en.wikipedia.org/wiki/Unicode_equivalence>).
When the computer outputs the sequence, the visible result is the single
letter "á". The human user doesn't know—or care—about the internal
representation.

The user's expectation is that the visible letter "á" should behave like
any other single letter. For example, a text editor should move the
cursor past it with a single click of a left or right arrow key. Also,
if I perform a regular-expression search in the editor and look for

   Alv[aá]rez

I should get a match with either Alvarez or Alvárez.

> And, in my definition, the whole Unicode is a huge junkyard, to start
> with.

I don't think anybody denies that. However, it's the best thing
available and—more importantly—a universally accepted standard.

> But opinions may vary, and in case you prefer or forced to write "á",
> then it can be impractical to store it as two characters, regardless
> of encoding.

Now I'm not following you.


Marko



More information about the Python-list mailing list