unicodedata implementation - categories

chris.monsanto at gmail.com chris.monsanto at gmail.com
Sat Oct 13 18:23:30 EDT 2007


On Oct 13, 4:32 pm, James Abley <james.ab... at gmail.com> wrote:
> Hi,
>
> I'm trying to understand how CPython implements unicodedata, with a view to
> providing an implementation for Jython. This is a background, low priority
> thing for me, since I last posted to this list about it in February!
>
> Python 2.5.1 (r251:54863, May  2 2007, 16:56:35)
> [GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.>>> import unicodedata
> >>> c = unichr(0x10FFFF)
> >>> unicodedata.name(c)
>
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> ValueError: no such name>>> unicodedata.category(unichr(0x10FFFF))
>
> 'Cn'
>
> 0x10FFFF is not a valid codepoint in Unicode 4.1, which is the version of
> the Unicode standard that Python 2.5 supports.
>
> So I have a couple of questions:
>
> 1) Why doesn't the category method raise an Exception, like the name method
> does?
> 2) Given that the category method doesn't currently raise an Exception,
> please could someone explain how the category is calculated? I have tried to
> figure it out based on the CPython code, but I have thus far failed, and I
> would also prefer to have it explicitly defined, rather than mandating that
> a Jython (.NET, etc) implementation uses the same (possibly non-optimal for
> Java) data structures and algorithms.
>
> My background is Mathematics rather than pure Computer Science, so doubtless
> I still have some gaps in my education to be filled when it comes to data
> structures and algorithms and I would welcome the opportunity to fill some
> of those in. References to Knuth or some on-line reading would be much
> appreciated, to help me understand the CPython part.
>
> Cheers,
>
> James
> --
> View this message in context:http://www.nabble.com/unicodedata-implementation---categories-tf46194...
> Sent from the Python - python-list mailing list archive at Nabble.com.

Cn is the "Other, Not Assigned" category in Unicode. No characters in
Unicode have this property. I'm not sure why it doesn't raise an
Exception, but if category() returns Cn, then you know it's not a
valid character.




More information about the Python-list mailing list