Counting unicode graphemes in python

vincent wehren vincent at visualtrans.de
Fri Oct 24 12:55:20 EDT 2003


"Srinath Avadhanula" <srinathava_news at yahoo.com> schrieb im Newsbeitrag
news:Pine.SOL.4.44.0310240839360.21207-100000 at albinoni.EECS.Berkeley.EDU...
| Hello,
|
| I am wondering if there is a way of counting graphemes (or
| glyphs) in python. For example, in the following string:
|
| u'\u0915\u093e\u0915'
| (
| or equivalently,
| u"\N{DEVANAGARI LETTER KA}\N{DEVANAGARI VOWEL SIGN AA}\N{DEVANAGARI LETTER
KA}"
| )
|
| the first two "code points" represent a single character on the screen.

My GUESS is that you can do that unless you *know* exactly which codepoints
form ligatures. In DEVANAGARI this are e.g. the so-called dependent vowels
in range 093e - 094c, wherin 093f stands "left of the consonant" when
rendered. (My knowledge of Indic languages is limited, at best, so there may
be mor to it..)



| In my application, the GUI seems to handle that part (i.e combining
| characters). However, I need to handle cursor movement myself. The GUI
| can only be told to move forward by a specified number of bytes.

What GUI are you working with?

| Therefore, to make cursor keys move over graphemes or glyps rather than
| code-points, I need to figure out a way to calculate grapheme boundaries
| in python. I searched the web for a long long time and came up with a
| few results, the most relevant of which seems to be:
|
| http://www.unicode.org/reports/tr29/tr29-2.html
|
| This page contains rules for calculating grapheme boundaries for Hangul
| characters or something of that sort. However, I did not find any
| information about more general algorithms.


Some systems such as the X Server on IndiX seem to dig into the  GPOS and
GSUB tables in the OpenType font. See:

http://rohini.ncst.ernet.in/indix/doc/HOWTO/Devanagari-HOWTO-5.html



|
| I also took a look at the unicodedata module in python and that seems to
| have a function called unicodedata.category. This function seems to
| returns strings 'Mn' for u'\u093f' and 'Lo' for u'\u093e'. However, I
| have been unable to find a reference for what these strings signify.
| Where should I look for them? (I am hoping for something more specific
| than "Look at www.unicode.org")

Would "Look at
http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values " do?

HTH,
Vincent Wehren

|
| Thanks,
| Srinath
|






More information about the Python-list mailing list