Counting unicode graphemes in python

Fri Oct 24 11:55:54 EDT 2003

Hello,

I am wondering if there is a way of counting graphemes (or
glyphs) in python. For example, in the following string:

u'\u0915\u093e\u0915'
(
or equivalently,
u"\N{DEVANAGARI LETTER KA}\N{DEVANAGARI VOWEL SIGN AA}\N{DEVANAGARI LETTER KA}"
)

the first two "code points" represent a single character on the screen.
In my application, the GUI seems to handle that part (i.e combining
characters). However, I need to handle cursor movement myself. The GUI
can only be told to move forward by a specified number of bytes.
Therefore, to make cursor keys move over graphemes or glyps rather than
code-points, I need to figure out a way to calculate grapheme boundaries
in python. I searched the web for a long long time and came up with a
few results, the most relevant of which seems to be:

http://www.unicode.org/reports/tr29/tr29-2.html

This page contains rules for calculating grapheme boundaries for Hangul
characters or something of that sort. However, I did not find any
information about more general algorithms.

I also took a look at the unicodedata module in python and that seems to
have a function called unicodedata.category. This function seems to
returns strings 'Mn' for u'\u093f' and 'Lo' for u'\u093e'. However, I
have been unable to find a reference for what these strings signify.
Where should I look for them? (I am hoping for something more specific
than "Look at www.unicode.org") Is this information relevant at all for
counting graphemes?

Thanks,
Srinath