regenerating unicodedata for py2.7 using py3 makeunicodedata.py?

Vlastimil Brom vlastimil.brom at gmail.com
Sat Nov 13 12:55:35 EST 2010


Hi all,
I'd like to ask about a surprising possibility I found while
investigating the new unicode 6.0 standard for use in python.
As python 2 series won't be updated in this regard
( http://bugs.python.org/issue10400 ),
I tried my "poor man's approach" of compiling the needed pyd file with
the recent unicode data (cf. the older post
http://mail.python.org/pipermail/python-list/2010-March/1240002.html )
While checking the changed format, i found to my big surprise, that it
is possible to generate the header files using the py3
makeunicodedata.py
which has already been updated for Unicode 6.0; this is even much more
comfortable than the previous versions, as the needed data are
downloaded automatically.
http://svn.python.org/view/python/branches/py3k/Tools/unicode/makeunicodedata.py?view=markup&pathrev=85371
It turned out, that the resulting headers are accepted by MS Visual
C++ Express along with the py2.7 source files
and that the generated unicodedata.pyd seems to be working work at
least in the cases I tested sofar.

Is this intended or even guaranteed for these generated files to be
compatible across py2.7 and py3, or am I going to be bitten by some
less obvious issues later?

The newly added ranges and characters are available, only in the CJK
Unified Ideographs Extension D the character names are not present
(while categories are), but this appears to be the same in the
original unicodedadata with 5.2 on CJK Unified Ideographs Extension C.

>>> unicodedata.unidata_version
'6.0.0'
>>> unicodedata.name(u"\U0002B740") # 0x2B740-0x2B81F; CJK Unified Ideographs Extension D # unicode 6.0 addition
Traceback (most recent call last):
  File "<input>", line 1, in <module>
ValueError: no such name
>>> unicodedata.category(u"\U0002B740")
'Lo'
>>>

###########################


>>> unicodedata.unidata_version
'5.2.0'
>>> unicodedata.name(u"\U0002A700") # 0x2A700-0x2B73F; CJK Unified Ideographs Extension C
Traceback (most recent call last):
  File "<input>", line 1, in <module>
ValueError: no such name
>>> unicodedata.category(u"\U0002A700")
'Lo'
>>>

Could  please anybody confirm, whether this way of updating the
unicodedata for 2.7 is generaly viable or point out possible problem
this may lead to?
Many thanks in advance,
        Vlastimil Brom



More information about the Python-list mailing list