Levenshtein word comparison -performance issue

Sat Feb 14 04:31:38 EST 2009

Gabriel Genellina wrote:

> En Fri, 13 Feb 2009 08:16:00 -0200, S.Selvam Siva <s.selvamsiva at gmail.com>
> escribió:
> 
>> I need some help.
>> I tried to find top n(eg. 5) similar words for a given word, from a
>> dictionary of 50,000 words.
>> I used python-levenshtein module,and sample code is as follow.
>>
>> def foo(searchword):
>>     disdict={}
>>     for word in self.dictionary-words:
>>                    distance=Levenshtein.ratio(searchword,word)
>>                    disdict[word]=distance
>>     """
>>      sort the disdict dictionary by values in descending order
>>     """
>>     similarwords=sorted(disdict, key=disdict.__getitem__, reverse=True)
>>
>>     return similarwords[:5]
> 
> You may replace the last steps (sort + slice top 5) by heapq.nlargest - at
> least you won't waste time sorting 49995 irrelevant words...
> Anyway you should measure the time taken by the first part (Levenshtein),
> it may be the most demanding. I think there is a C extension for this,
> should be much faster than pure Python calculations.
> 

[I didn't see the original post]

You can use the distance instead of the ratio and put the words into bins of
the same length. Then if you find enough words with a distance <= 1 in the
bin with the same length as the search word you can stop looking.

You might be able to generalize this approach to other properties that are
fast to calculate and guarantee a minimum distance, e. g. set(word).

Peter