[Baypiggies] April snippets meeting - Unicode normalisation/normalization trick
Chris Clark
Chris.Clark at ingres.com
Fri Apr 13 07:13:44 CEST 2007
Here is the "barely a snippet, more of a reminder of the wealth of
libraries that Python ships with", I showed this evening. jj told me
that there is a similar piece of code in the cookbook, I've not checked
it for fear of seeing a better explanation....
I've sent this mail in cp1252 encoding (almost latin1), if you can't see
some of the special characters in it, try the html attachment in a
browser instead.
I needed to perform string comparisons so that:
"o" == "ö"
would be considered true. I.e. lower case "O" matches lower case "O"
with umlaut. I really wanted this to work with all decorated characters,
"A acute", "A caret", ......
The unicodedate library has a normalisation function which allows
normalisation to different forms. One of the forms decomposes a single
decorated characters into surrogate pairs of the undecoratered character
+ the decorater. If you can strip the decorater off you end up with the
undecoratered character. Simply encoding in 7 bit ascii (with an ignore
unmappables) happens to do just that very thing. Viz.:
>>> import unicodedata
>>> test_str = u'Bj\N{LATIN SMALL LETTER O WITH DIAERESIS}rk'
>>> test_str
u'Bj\xf6rk'
>>> print test_str
Björk
>>> unicodedata.normalize('NFKD', test_str )
u'Bjo\u0308rk'
>>> unicodedata.normalize('NFKD', test_str ).encode('ASCII', 'ignore')
'Bjork'
Once converted you can perform comparisons, storage, or even convert
back to Unicode :-)
One word of caution, some characters won't decompose (for sensible
reasons), this is really intended for decorated or accented characters.
E.g.:
>>> test_str = u'\N{LATIN CAPITAL LETTER AE}'
>>> test_str
u'\xc6'
>>> print test_str
Æ
>>> unicodedata.normalize('NFKD', test_str ).encode('ASCII', 'ignore')
''
I.e. u'\N{LATIN CAPITAL LETTER AE}' is already normalised (to NFKD).
http://unicode.org is *the* place for all things Unicode but there are
some sites around that are slightly more friendly for simple lookups,
e.g. http://www.fileformat.info/info/unicode/char/00f6/index.htm gives
the name, picture (in case you do not have a suitable font installed) as
well as a bunch of other truly useful information.
Chris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/baypiggies/attachments/20070412/312c2dc2/attachment.html
More information about the Baypiggies
mailing list