How to waste computer memory?

Chris Angelico rosuav at gmail.com
Sat Mar 19 10:14:41 EDT 2016


On Sat, Mar 19, 2016 at 11:42 PM, Marko Rauhamaa <marko at pacujo.net> wrote:
>> The problem is not so much the existence of combining characters, but that
>> *some* but not all accented characters are available in two forms: a
>> composed single code point, and a decomposed pair of code points.
>
> Also, is an a with ring on top and another ring on bottom the same
> character as an a with ring on bottom and another ring on top?

Unicode has an answer for this one. It's called normalization, and
actually it doesn't quite go as far as I thought, but it does at least
solve this exact question.

>>> print(ascii(unicodedata.normalize("NFC","a\u0325\u030a")))
'\u1e01\u030a'
>>> print(ascii(unicodedata.normalize("NFC","a\u030a\u0325")))
'\u1e01\u030a'
>>> print(ascii(unicodedata.normalize("NFD","a\u0325\u030a")))
'a\u0325\u030a'
>>> print(ascii(unicodedata.normalize("NFD","a\u030a\u0325")))
'a\u0325\u030a'

So yes, they are the same combined character. Whether you ask for the
composed form or the decomposed form, you get the exact same sequence
of codepoints from either initial ordering - either this:

'a' LATIN SMALL LETTER A
'\u0325' COMBINING RING BELOW
'\u030a' COMBINING RING ABOVE

or this:

'\u1e01' LATIN SMALL LETTER A WITH RING BELOW
'\u030a' COMBINING RING ABOVE

but never this:

'\xe5' LATIN SMALL LETTER A WITH RING ABOVE
'\u0325' COMBINING RING BELOW

which will normalize to either of the above.

I had been of the belief that NFC/NFD normalization would *always*
provide a canonical ordering for the combining characters, but
apparently only some are affected:

>>> print(ascii(unicodedata.normalize("NFC","q\u0303\u0301")))
'q\u0303\u0301'
>>> print(ascii(unicodedata.normalize("NFC","q\u0301\u0303")))
'q\u0301\u0303'

(And NFK[CD] doesn't change this either.) But if you're really worried
about these kinds of equivalencies, you could write your own
"super-normalize" function which first NFKD normalizes, then sorts all
sequences of combining characters into codepoint order, and finally
NFKC or NFKD normalizes to canonicalize everything.

ChrisA



More information about the Python-list mailing list