unicodedata . normalize (NFD - NFC) inconsistency
Christos TZOTZIOY Georgiou
tzot at sil-tec.gr
Mon Nov 8 08:23:16 EST 2004
I found at least one case where decombining and recombining a unicode
character does not result in the same character (see at end).
I have no extensive knowledge about Unicode, yet I believe that this
must be a problem of the Unicode 3.2 specification and not Python's.
However, I haven't found out how the decomp_data (in unicodedata_db.h)
is built, and neither did I find much more info about the specifics of
Unicode 3.2. I thought about posting here; anyone more knowing could
give it a look.
If we find out that it's a problem with Python, I'll open a bug report
(and volunteer work).
*** Example ***
>>> import unicodedata as ud
>>> def report(utext):
for uchar in utext:
print ord(uchar), ud.name(uchar)
>>> u1=u'\N{greek small letter alpha with oxia}'
>>> report(u1)
8049 GREEK SMALL LETTER ALPHA WITH OXIA
>>> u2=ud.normalize('NFD', u1)
>>> report(u2)
945 GREEK SMALL LETTER ALPHA
769 COMBINING ACUTE ACCENT
>>> u3=ud.normalize('NFC', u2)
>>> report(u3)
940 GREEK SMALL LETTER ALPHA WITH TONOS
>>>
*** End of Example ***
I can understand this confusion; if, as I have found, there is no
COMBINING GREEK TONOS or COMBINING TONOS ACCENT in the Unicode table,
decombining, one has to use the 'oxeia' (acute) accent...
--
TZOTZIOY, I speak England very best,
"Tssss!" --Brad Pitt as Achilles in unprecedented Ancient Greek
More information about the Python-list
mailing list