ascii to latin1

Serge Orlov Serge.Orlov at gmail.com
Mon May 8 21:07:15 EDT 2006


Luis P. Mendes wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi,
>
> I'm developing a django based intranet web server that has a search page.
>
> Data contained in the database is mixed.  Some of the words are
> accented, some are not but they should be.  This is because the
> collection of data  began a long time ago when ascii was the only way to go.
>
> The problem is users have to search more than once for some word,
> because the searched word can be or not be accented.  If we consider
> that some expressions can have several letters that can be accented, the
> search effort is too much.
>
> I've searched the net for some kind of solution but couldn't find.  I've
> just found for the opposite.
>
> example:
> if the word searched is 'televisão', I want that a search by either
> 'televisao', 'televisão' or even 'télévisao' (this last one doesn't
> exist in Portuguese) is successful.
>
> So, instead of only one search, there will be several used.
>
> Is there anything already coded, or will I have to try to do it all by
> myself?

You need to covert from latin1 to ascii not from ascii to latin1. The
function below does that. Then you need to build database index not on
latin1 text but on ascii text. After that convert user input to ascii
and search.

import unicodedata

def search_key(s):
    de_str = unicodedata.normalize("NFD", s)
    return ''.join(cp for cp in de_str if not
unicodedata.category(cp).startswith('M'))

print search_key(u"televisão")
print search_key(u"télévisao")

===== Result:
televisao
televisao




More information about the Python-list mailing list