unicodedata . normalize (NFD - NFC) inconsistency

Christos TZOTZIOY Georgiou tzot at sil-tec.gr
Mon Nov 8 08:23:16 EST 2004


I found at least one case where decombining and recombining a unicode
character does not result in the same character (see at end).

I have no extensive knowledge about Unicode, yet I believe that this
must be a problem of the Unicode 3.2 specification and not Python's.
However, I haven't found out how the decomp_data (in unicodedata_db.h)
is built, and neither did I find much more info about the specifics of
Unicode 3.2.  I thought about posting here; anyone more knowing could
give it a look.

If we find out that it's a problem with Python, I'll open a bug report
(and volunteer work).

*** Example ***

>>> import unicodedata as ud
>>> def report(utext):
	for uchar in utext:
		print ord(uchar), ud.name(uchar)

		
>>> u1=u'\N{greek small letter alpha with oxia}'
>>> report(u1)
8049 GREEK SMALL LETTER ALPHA WITH OXIA
>>> u2=ud.normalize('NFD', u1)
>>> report(u2)
945 GREEK SMALL LETTER ALPHA
769 COMBINING ACUTE ACCENT
>>> u3=ud.normalize('NFC', u2)
>>> report(u3)
940 GREEK SMALL LETTER ALPHA WITH TONOS
>>> 

*** End of Example ***

I can understand this confusion; if, as I have found, there is no
COMBINING GREEK TONOS or COMBINING TONOS ACCENT in the Unicode table,
decombining, one has to use the 'oxeia' (acute) accent...
-- 
TZOTZIOY, I speak England very best,
"Tssss!" --Brad Pitt as Achilles in unprecedented Ancient Greek



More information about the Python-list mailing list