Levenshtein word comparison -performance issue

Gabriel Genellina gagsl-py2 at yahoo.com.ar
Fri Feb 13 05:42:19 EST 2009


En Fri, 13 Feb 2009 08:16:00 -0200, S.Selvam Siva <s.selvamsiva at gmail.com>  
escribió:

> I need some help.
> I tried to find top n(eg. 5) similar words for a given word, from a
> dictionary of 50,000 words.
> I used python-levenshtein module,and sample code is as follow.
>
> def foo(searchword):
>     disdict={}
>     for word in self.dictionary-words:
>                    distance=Levenshtein.ratio(searchword,word)
>                    disdict[word]=distance
>     """
>      sort the disdict dictionary by values in descending order
>     """
>     similarwords=sorted(disdict, key=disdict.__getitem__, reverse=True)
>
>     return similarwords[:5]

You may replace the last steps (sort + slice top 5) by heapq.nlargest - at  
least you won't waste time sorting 49995 irrelevant words...
Anyway you should measure the time taken by the first part (Levenshtein),  
it may be the most demanding. I think there is a C extension for this,  
should be much faster than pure Python calculations.

-- 
Gabriel Genellina




More information about the Python-list mailing list