[Baypiggies] April snippets meeting - Unicode normalisation/normalization trick

Fri Apr 13 07:13:44 CEST 2007

Here is the "barely a snippet, more of a reminder of the wealth of 
libraries that Python ships with", I showed this evening. jj told me 
that there is a similar piece of code in the cookbook, I've not checked 
it for fear of seeing a better explanation....

I've sent this mail in cp1252 encoding (almost latin1), if you can't see 
some of the special characters in it, try the html attachment in a 
browser instead.

I needed to perform string comparisons so that:

    "o" == "ö"

would be considered true. I.e. lower case "O" matches lower case "O" 
with umlaut. I really wanted this to work with all decorated characters, 
"A acute", "A caret", ......

The unicodedate library has a normalisation function which allows 
normalisation to different forms. One of the forms decomposes a single 
decorated characters into surrogate pairs of the undecoratered character 
+ the decorater. If you can strip the decorater off you end up with the 
undecoratered character. Simply encoding in 7 bit ascii (with an ignore 
unmappables) happens to do just that very thing. Viz.:

     >>> import unicodedata
     >>> test_str = u'Bj\N{LATIN SMALL LETTER O WITH DIAERESIS}rk'
     >>> test_str
    u'Bj\xf6rk'
     >>> print test_str
    Björk
     >>> unicodedata.normalize('NFKD', test_str )
    u'Bjo\u0308rk'
     >>> unicodedata.normalize('NFKD', test_str ).encode('ASCII', 'ignore')
    'Bjork'

Once converted you can perform comparisons, storage, or even convert 
back to Unicode :-)

One word of caution, some characters won't decompose (for sensible 
reasons), this is really intended for decorated or accented characters. 
E.g.:

     >>> test_str = u'\N{LATIN CAPITAL LETTER AE}'
     >>> test_str
    u'\xc6'
     >>> print test_str
    Æ
     >>> unicodedata.normalize('NFKD', test_str ).encode('ASCII', 'ignore')
    ''

I.e. u'\N{LATIN CAPITAL LETTER AE}' is already normalised (to NFKD).

http://unicode.org is *the* place for all things Unicode but there are 
some sites around that are slightly more friendly for simple lookups, 
e.g. http://www.fileformat.info/info/unicode/char/00f6/index.htm gives 
the name, picture (in case you do not have a suitable font installed) as 
well as a bunch of other truly useful information.

Chris

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/baypiggies/attachments/20070412/312c2dc2/attachment.html