grapheme cluster library

Steve D'Aprano steve+python at pearwood.info
Mon Oct 23 03:45:17 EDT 2017


On Mon, 23 Oct 2017 05:47 pm, Rustom Mody wrote:

> On Monday, October 23, 2017 at 8:06:03 AM UTC+5:30, Lawrence D’Oliveiro
> wrote:
[...]
>> Bear in mind that the logical representation of the text is as code points,
>> graphemes would have more to do with rendering.
> 
> Heh! Speak of Euro/Anglo-centrism!

I think that Lawrence may be thinking of glyphs. Glyphs are the display form
that are rendered. Graphemes are the smallest unit of written language.


> In a sane world graphemes would be called letters

Graphemes *aren't* letters.

For starters, not all written languages have an alphabet. No alphabet, no
letters. Even in languages with an alphabet, not all graphemes are letters.

Graphemes include:

- logograms (symbols which represent a morpheme, an entire word, or 
  a phrase), e.g. Chinese characters, ampersand &, the ™ trademark 
  or ® registered trademark symbols;

- syllabic characters such as Japanese kana or Cherokee;

- letters of alphabets;

- letters with added diacritics;

- punctuation marks;

- mathematical symbols;

- typographical symbols;

- word separators;

and more. Many linguists also include digraphs (pairs of letters) like the
English "th", "sh", "qu", or "gh" as graphemes.


https://www.thoughtco.com/what-is-a-grapheme-1690916

https://en.wikipedia.org/wiki/Grapheme


> And unicode codepoints would be called something else — letterlets??
> To be fair to the Unicode consortium, they strive hard to call them
> codepoints But in an anglo-centric world, the conflation of codepoint to
> letter is inevitable I guess. To hear how a non Roman-centric view of the 
> world would sound: A 'w' is a poorly double-struck 'u'
> A 't' is a crossed 'l'
> Reasonable?

No, T is not a crossed L -- they are unrelated letters and the visual
similarity is a coincidence. They are no more connected than E is just an F
with an extra line.

But you are more right than you knew regarding W: it *literally was* a
doubled-up V (sometimes written U) once upon a time.

For a long time W did not appear in the Latin alphabet, even after people used
it in written text. It was considered a digraph VV then a ligature and
finally, only gradually, a proper letter. As late as the 16th century the
German grammatican Valentin Ickelshamer complained that hardly anyone,
including school masters, knew what to do with W or what it was called.

https://en.wikipedia.org/wiki/W#History



> The lead of https://en.wikipedia.org/wiki/%C3%9C has
> 
> | Ü, or ü, is a character…classified as a separate letter in several
> | extended Latin alphabets
> | (including Azeri, Estonian, Hungarian and Turkish), but as the letter U
> | with an umlaut/diaeresis in others such as Catalan, French, Galician,
> | German, Occitan and Spanish.


Indeed: sometimes the same grapheme is considered a letter in one language and
a letter-plus-diacritic in another.



-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.




More information about the Python-list mailing list