ascii to latin1

Serge Orlov Serge.Orlov at gmail.com
Tue May 9 09:06:01 EDT 2006


Richie Hindle wrote:
> [Serge]
> > def search_key(s):
> >     de_str = unicodedata.normalize("NFD", s)
> >     return ''.join(cp for cp in de_str if not
> >                    unicodedata.category(cp).startswith('M'))
>
> Lovely bit of code - thanks for posting it!

Well, it is not so good. Please read my next message to Luis.

>
> You might want to use "NFKD" to normalize things like LATIN SMALL
> LIGATURE FI and subscript/superscript characters as well as diacritics.

IMHO It is perfectly acceptable to declare you don't interpret those
symbols.  After all they are called *compatibility* code points. I
tried "a quater" symbol: Google and MSN don't interpret it. Yahoo
doesn't support it at all.

NFKD form is also more tricky to use. It loses semantic of characters,
for example if you have character "digit two" followed by "superscript
digit two"; they look like 2 power 2, but NFKD will convert them into
22 (twenty two), which is wrong. So if you want to use NFKD for search
your will have to preprocess your data, for example inserting space
between the twos.




More information about the Python-list mailing list