ascii to latin1

Serge Orlov Serge.Orlov at gmail.com
Tue May 9 09:06:54 EDT 2006


Luis P. Mendes wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Richie Hindle escreveu:
> > [Serge]
> >> def search_key(s):
> >>     de_str = unicodedata.normalize("NFD", s)
> >>     return ''.join(cp for cp in de_str if not
> >>                    unicodedata.category(cp).startswith('M'))
> >
> > Lovely bit of code - thanks for posting it!
> >
> > You might want to use "NFKD" to normalize things like LATIN SMALL
> > LIGATURE FI and subscript/superscript characters as well as diacritics.
> >
>
> Thank you very much for your info.  It's a very good aproach.
>
> When I used the "NFD" option, I came across many errors on these and
> possibly other codes: \xba, \xc9, \xcd.

What errors? normalize method is not supposed to give any errors. You
mean it doesn't work as expected? Well, I have to admit that using
normalize is a far from perfect way to  implement search. The most
advanced algorithm is published by Unicode guys:
<http://www.unicode.org/reports/tr10/> If you read it you'll understand
it's not so easy.

>
> I tried to use "NFKD" instead, and the number of errors was only about
> half a dozen, for a universe of 600000+ names, on code \xbf.
> It looks like I have to do a search and substitute using regular
> expressions for these cases.  Or is there a better way to do it?

Perhaps you can use unicode translate method to map the characters that
still give you problems to whatever you want.




More information about the Python-list mailing list