unicodedata implementation - categories

"Martin v. Löwis" martin at v.loewis.de
Sun Oct 14 03:35:16 EDT 2007


> 1) Why doesn't the category method raise an Exception, like the name method
> does?

As Chris explains, the result category means "Other, Not Assigned".
Python returns this category because it's the truth: for those
characters, the value of the "category" property really *is* Cn;
it means that they are not assigned.

If you are wondering how unicodedata.c comes up with the result:
the unassigned characters get a record index of 0, and that has a
category value of 0, which is "Cn".

> 2) Given that the category method doesn't currently raise an Exception,
> please could someone explain how the category is calculated? I have tried to
> figure it out based on the CPython code, but I have thus far failed, and I
> would also prefer to have it explicitly defined, rather than mandating that
> a Jython (.NET, etc) implementation uses the same (possibly non-optimal for
> Java) data structures and algorithms. 

You definitely should *not* follow the Python implementation. Instead,
the Unicode database is defined by the Unicode consortium, so the
Unicode standard is the ultimate specification.

To implement it in Java, I recommend to use java.lang.Character.getType.
If that returns java.lang.Character.UNASSIGNED, return "Cn".

Regards
Martin



More information about the Python-list mailing list