Levenshtein word comparison -performance issue

Basilisk96 basilisk96 at gmail.com
Fri Feb 13 19:54:32 EST 2009


On Feb 13, 5:42 am, "Gabriel Genellina" <gagsl-... at yahoo.com.ar>
wrote:
> You may replace the last steps (sort + slice top 5) by heapq.nlargest - at  
> least you won't waste time sorting 49995 irrelevant words...
> Anyway you should measure the time taken by the first part (Levenshtein),  
> it may be the most demanding. I think there is a C extension for this,  
> should be much faster than pure Python calculations.

subdist: http://pypi.python.org/pypi/subdist/0.2.1
It uses a modified "fuzzy" version of the Levenshtein algorithm, which
I found more useful than the strict version. The only quirk to it is
that it accepts nothing but unicode. Other than that, it's a keeper.
It is extremely fast.

Cheers,
-Basilisk96



More information about the Python-list mailing list