grapheme cluster library (Posting On Python-List Prohibited)

Rustom Mody rustompmody at gmail.com
Mon Oct 23 02:47:02 EDT 2017


On Monday, October 23, 2017 at 8:06:03 AM UTC+5:30, Lawrence D’Oliveiro wrote:
> On Saturday, October 21, 2017 at 5:11:13 PM UTC+13, Rustom Mody wrote:
> > Is there a recommended library for manipulating grapheme clusters?
> 
> Is this <http://anoopkunchukuttan.github.io/indic_nlp_library/> any good?

Thanks looks promising.
Dunno how much it lives up to the claims 
[For now the one liner from regex's findall has sufficed:
findall(r'\X', «text»)  

[Thanks MRAB for the library]
 
> Bear in mind that the logical representation of the text is as code points, graphemes would have more to do with rendering.

Heh! Speak of Euro/Anglo-centrism!

In a sane world graphemes would be called letters
And unicode codepoints would be called something else — letterlets??
To be fair to the Unicode consortium, they strive hard to call them codepoints
But in an anglo-centric world, the conflation of codepoint to letter is inevitable I guess.
To hear how a non Roman-centric view of the world would sound:
A 'w' is a poorly double-struck 'u'
A 't' is a crossed 'l'
Reasonable?

The lead of https://en.wikipedia.org/wiki/%C3%9C has

| Ü, or ü, is a character…classified as a separate letter in several extended 
Latin alphabets 
| (including Azeri, Estonian, Hungarian and Turkish), but as the letter U with an 
| umlaut/diaeresis in others such as Catalan, French, Galician, German, Occitan 
and Spanish.



More information about the Python-list mailing list