unicodedata . normalize (NFD - NFC) inconsistency
Brion Vibber
brion at pobox.com
Mon Nov 8 20:40:47 EST 2004
Christos TZOTZIOY Georgiou wrote:
> I found at least one case where decombining and recombining a unicode
> character does not result in the same character (see at end).
>
> I have no extensive knowledge about Unicode, yet I believe that this
> must be a problem of the Unicode 3.2 specification and not Python's.
I've been spending some time lately writing a normalizer (in PHP of all
things -- yeesh!), and yes Unicode is a scary world. :) Although it may
seem counterintuitive, it is in fact perfectly legitimate for a
character not to be its own canonical composition.
>>>>u1=u'\N{greek small letter alpha with oxia}'
>>>>report(u1)
>
> 8049 GREEK SMALL LETTER ALPHA WITH OXIA
This character is a "singleton decomposition". It decomposes into GREEK
SMALL LETTER ALPHA WITH TONOS, which further decomposes into GREEK SMALL
LETTER ALPHA and a COMBINING ACUTE ACCENT.
It is by definition not normalized, so when you normalize it to form C
it will turn into GREEK SMALL LETTER ALPHA WITH TONOS; there is no way
to get "back" to the original character in a normalized string. For some
more info see:
http://www.unicode.org/unicode/reports/tr15/#Primary_Exclusion_List_Table
>>>>u2=ud.normalize('NFD', u1)
>>>>report(u2)
>
> 945 GREEK SMALL LETTER ALPHA
> 769 COMBINING ACUTE ACCENT
>
>>>>u3=ud.normalize('NFC', u2)
>>>>report(u3)
>
> 940 GREEK SMALL LETTER ALPHA WITH TONOS
You should get this same result directly for ud.normalize('NFC', u1).
Converting directly to NFC should always give the same result as
converting to NFD and then NFC. Either will give you back the string you
started with if and only if it's already normalized to form C.
-- brion vibber (brion @ pobox.com)
More information about the Python-list
mailing list