Python Unicode handling wins again -- mostly

Ethan Furman ethan at stoneleaf.us
Mon Dec 2 16:27:08 EST 2013


On 12/02/2013 01:23 PM, Chris Angelico wrote:
> On Tue, Dec 3, 2013 at 8:14 AM, Ned Batchelder <ned at nedbatchelder.com> wrote:
>> This is where my knowledge about Unicode gets fuzzy.  Isn't it the case that
>> some grapheme clusters (or whatever the right word is) can't be normalized
>> down to a single code point?  Characters can accept many accents, for
>> example.
>
> You can't normalize everything down to a single code point, but you
> can normalize the other way by breaking out everything that can be
> broken out.
>
>>>> print(ascii(unicodedata.normalize("NFKC", "ä")))
> '\xe4'
>>>> print(ascii(unicodedata.normalize("NFKD", "ä")))
> 'a\u0308'

Well, Stephen was right then!  There's room for a library to handle this situation.  Or is there one already?

--
~Ethan~



More information about the Python-list mailing list