Fuzzy string matching?

Magnus L. Hetland mlh at idt.ntnu.no
Sat Aug 28 13:31:23 EDT 1999


Moshe Zadka <moshez at server.python.net> writes:

> On 28 Aug 1999, Magnus L. Hetland wrote:
[...]
> > Could perhaps be a bit more readable, though... (Maybe I'll work on
> > that...)
> 
> No need, my good man. We have a saying in Hebrew "The work of the rightous
> gets done by others"

OK - thanks :)

> 
> Added a comment, made a few variables more meaningful, precalculated
> ranges, changed some ; to , and here is the changed, and a bit tested,
> version:

Awright... I guess it's bit more readable now... However, there are a
few things I think I would do differently - surly only a matter of
taste... (and not meant as a sign of ingratitude :)

> 
> def distance(a,b):
>     n, m = map(len, (a,b))

Is the use of map really clearer than not using it here, with only two
variables? (It may be faster since you don't have to look up len
twice, but...)

>     if n>m:  # Make sure n<=m
>         a,b,n,m = b,a,m,n

I guess I wrote this statement in the original - but I think perhaps I
would split the strings and lengths into two different assignments to
make it more readable, really...

>     range_m, range_n= map(range, (m, n))

Hm. Why? Why is range_m better than range(m)?

>     curr = range(0, n+1)
>     for i in range_m:
>         prev, curr = curr, [i+1]
>         for j in range_n:
>             add, dele = prev[j+1]+1, curr[j]+1
>             change = prev[j] + (a[j]!=b[i] and 1 or 0)

Hm. This is perhaps more compact - and the and/or thing *is* an idiom,
but I don't think it's more readable than a plain if-statement...

>             curr.append(min(add, dele, change))
>     return curr[n]
> 

Here's another take - mainly an expression of my tastes in the matter
(for lack of anything more meaningful to do in the evening... ;)


def distance(a,b):
    "Calculates the Levenshtein distance between a and b."
    n, m = len(a), len(b)
    if n > m:
        # Make sure n <= m, to use O(min(n,m)) space
        a,b = b,a
        n,m = m,n
    current = range(0, n+1)
    for i in range(m):
        previous, current = current, [i+1]
        for j in range(n):
            add, delete = previous[j+1]+1, current[j]+1
            change = previous[j]
            if a[j] != b[i]:
                change = change + 1
            current.append(min(add, delete, change))
    return current[n]


Oh well... There's nothing like fiddling with a little piece of code.
Much more fun than writing large programs methinks :)

--

  Magnus              Making no sound / Yet smouldering with passion
  Lie          The firefly is still sadder / Than the moaning insect
  Hetland                                       : Minamoto Shigeyuki




More information about the Python-list mailing list