unicodedata implementation

James Abley james.abley at gmail.com
Sun Feb 18 18:17:07 EST 2007


Hi,

[Originally posted this to the dev list, but the moderator advised
posting here first]

I'm looking into implementing this module for Jython, and I'm trying
to understand the contracts promised by the various methods. Please
bear in mind that means I'm probably targeting the CPython
implementation as of 2.3, although I would obviously be quite happy if
my implementation doesn't need too much extra to fit the 2.5
functionality!

As someone has previously posted [1], the documentation is a little
thin and they were pointed at the Unicode specification [2]. I've done
a little reading there, and have a little knowledge now, which is
always dangerous. There are still gaps, and I was hoping someone here
might be able to point out what I'm missing.

My problem, described here [3], but I'll summarise and add a little to it.

2468;CIRCLED DIGIT NINE;No;0;EN; 0039;;9;9;N;;;;;

(UnicodeData.txt [4] for Unicode 3.2.0 [5] entry for code-point 0x2468)

verify(unicodedata.decimal(u'\u2468',None) is None)
verify(unicodedata.digit(u'\u2468') == 9)
verify(unicodedata.numeric(u'\u2468') == 9.0)

That works fine, and I can see in the UnicodeData.txt file (the
mirrored property N towards the end is a fine marker; go back three
fields and then start working forward from there) that the decimal
property isn't defined, the digit property is 9 and the numeric
property is also 9.

However, this next bit is what confuses me:

325F;CIRCLED NUMBER THIRTY FIVE;No;0;ON; 0033 0035;;;35;N;;;;;

(UnicodeData.txt for Unicode 3.2.0 entry for code-point 0x325F)

verify(unicodedata.decimal(u'\u325F',None) is None)
verify(unicodedata.digit(u'\u325F', None) is None)
verify(unicodedata.numeric(u'\u325F') == 35.0)

The last one fails - ValueError: not a numeric character.

Now, again looking at the UnicodeData.txt entry and the mirrored N
property, working back three fields and going forward from there shows
that the decimal property isn't set, the digit property isn't set and
the numeric property appears to be 35.

So from my understanding of the Unicode (3.2.0) spec, the code point
0x325F has a numeric property with a value of 35, but the python (2.3
and 2.4 - I haven't put 2.5 onto my box yet) implementation of
unicodedata disagrees, presumably for good reason.

I can't see where I'm going wrong.

Cheers,

James

[1] http://groups.google.com/group/comp.lang.python/browse_frm/thread/39a894325686f329/7dbdda27be118836?lnk=st&q=unicodedata&rnum=10#7dbdda27be118836
[2] http://www.unicode.org/
[3] http://eternusuk.blogspot.com/2007/02/jython-unicodedata-initial-overview.html
[4] http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.txt
[5] http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.html



More information about the Python-list mailing list