Grapheme clusters, a.k.a.real characters

Rustom Mody rustompmody at gmail.com
Sun Jul 16 00:33:28 EDT 2017


On Sunday, July 16, 2017 at 4:09:16 AM UTC+5:30, Mikhail V wrote:
> On Sat, 15 Jul 2017 05:50 pm, Marko Rauhamaa wrote:
> > Random access to code points is as uninteresting as random access to
> > UTF-8 bytes.
> > I might want random access to the "Grapheme clusters, a.k.a.real
> > characters".
> 
> What _real_ characters are you referring to?
> If your data has "á" (U00E1), then it is one real character,
> if you have "a" (U0061) and "ˊ" (U02CA) then it is _two_
> real characters. So in both cases you have access to code points =
> real characters.

Right now in an adjacent mailing list (debian) I see someone signed off with a

grüß

I guess the third character is a u with some ‘dirt’
Whats the fourth?

> 
> For metaphysical discussion -  in _my_ definition there

s/metaphysical/linguistic

> is no such "real" character as "á", since it is the "a" glyph with some dirt,
> so according to my definition, it should be two separate characters,
> both semantically and technically seen.
> 
> And, in my definition, the whole Unicode is a huge junkyard, to start with.
> 
> But opinions may vary, and in case you prefer or forced to write "á",
> then it can be impractical to store it as two characters, regardless of
> encoding.

Heck even in the English that I learnt in school we had
ægis, homœopath etc
And just now looking up:
https://en.wikipedia.org/wiki/List_of_words_that_may_be_spelled_with_a_ligature
I see economics is œconomics!!

Seriously the word "ligature" like the word "grapheme" is misleading
Its not a graphical or typographic notion its an atom of the language's lexicon

No Hindi speaker seeing
क + ई = की
calls the last anything but a letter
And the vowel sign ी is never first class a vowel



More information about the Python-list mailing list