grapheme cluster library

Rustom Mody rustompmody at gmail.com
Mon Oct 23 10:25:29 EDT 2017


On Monday, October 23, 2017 at 1:15:35 PM UTC+5:30, Steve D'Aprano wrote:
> On Mon, 23 Oct 2017 05:47 pm, Rustom Mody wrote:
> 
> > On Monday, October 23, 2017 at 8:06:03 AM UTC+5:30, Lawrence D’Oliveiro
> > wrote:
> [...]
> >> Bear in mind that the logical representation of the text is as code points,
> >> graphemes would have more to do with rendering.
> > 
> > Heh! Speak of Euro/Anglo-centrism!
> 
> I think that Lawrence may be thinking of glyphs. Glyphs are the display form
> that are rendered. Graphemes are the smallest unit of written language.
> 
> 
> > In a sane world graphemes would be called letters
> 
> Graphemes *aren't* letters.
> 
> For starters, not all written languages have an alphabet. No alphabet, no
> letters. Even in languages with an alphabet, not all graphemes are letters.
> 
> Graphemes include:
> 
> - logograms (symbols which represent a morpheme, an entire word, or 
>   a phrase), e.g. Chinese characters, ampersand &, the ™ trademark 
>   or ® registered trademark symbols;
> 
> - syllabic characters such as Japanese kana or Cherokee;
> 
> - letters of alphabets;
> 
> - letters with added diacritics;
> 
> - punctuation marks;
> 
> - mathematical symbols;
> 
> - typographical symbols;
> 
> - word separators;
> 
> and more. Many linguists also include digraphs (pairs of letters) like the
> English "th", "sh", "qu", or "gh" as graphemes.
> 
> 
> https://www.thoughtco.com/what-is-a-grapheme-1690916
> 
> https://en.wikipedia.org/wiki/Grapheme

Um… Ok So I am using the wrong word? Your first link says:
| For example, the word 'ghost' contains five letters and four graphemes 
| ('gh,' 'o,' 's,' and 't')

Whereas new regex findall does:

>>> findall(r'\X', "ghost")
['g', 'h', 'o', 's', 't']
>>> findall(r'\X', "church")
['c', 'h', 'u', 'r', 'c', 'h']



More information about the Python-list mailing list