Levenshtein word comparison -performance issue
Gabriel Genellina
gagsl-py2 at yahoo.com.ar
Fri Feb 13 05:42:19 EST 2009
En Fri, 13 Feb 2009 08:16:00 -0200, S.Selvam Siva <s.selvamsiva at gmail.com>
escribió:
> I need some help.
> I tried to find top n(eg. 5) similar words for a given word, from a
> dictionary of 50,000 words.
> I used python-levenshtein module,and sample code is as follow.
>
> def foo(searchword):
> disdict={}
> for word in self.dictionary-words:
> distance=Levenshtein.ratio(searchword,word)
> disdict[word]=distance
> """
> sort the disdict dictionary by values in descending order
> """
> similarwords=sorted(disdict, key=disdict.__getitem__, reverse=True)
>
> return similarwords[:5]
You may replace the last steps (sort + slice top 5) by heapq.nlargest - at
least you won't waste time sorting 49995 irrelevant words...
Anyway you should measure the time taken by the first part (Levenshtein),
it may be the most demanding. I think there is a C extension for this,
should be much faster than pure Python calculations.
--
Gabriel Genellina
More information about the Python-list
mailing list