Fuzzy Lookups

ajones ajones1 at gmail.com
Mon Jan 30 20:28:47 EST 2006


BBands wrote:
> I have some CDs and have been archiving them on a PC. I wrote a Python
> script that spans the archive and returns a list of its contents:
> [[genre, artist, album, song]...]. I wanted to add a search function to
> locate all the versions of a particular song. This is harder than you
> might think. For example the Cajun "national anthem" is Jolie Blond,
> but it can be spelled several different ways jolie, joli, blon, blond,
> etc... In addition the various online services that provide song info
> are riddled with typos, so an ordinary string match just doesn't get
> you there. What is needed is a fuzzy string match and it turns out that
> there is a very good one, the Levenshtein distance, which is the number
> of inserts, deletions and substitutions needed to morph one string into
> another. In my application I match the desired song title against all
> song titles in my list and return the ones with the lowest Levenshtein
> distances. This is remarkably, one might even say stunningly,
> effective, easily finding all the version of Jolie Blon in the list.
>
> I am using the following snippet (found on the web, proper attribution
> unsure), which calculates the Levenshtein distance.
>
> def distance(a,b):
>     c = {}
>     n = len(a); m = len(b)
>
>     for i in range(0,n+1):
>         c[i,0] = i
>     for j in range(0,m+1):
>         c[0,j] = j
>
>     for i in range(1,n+1):
>         for j in range(1,m+1):
>             x = c[i-1,j]+1
>             y = c[i,j-1]+1
>             if a[i-1] == b[j-1]:
>                 z = c[i-1,j-1]
>             else:
>                 z = c[i-1,j-1]+1
>             c[i,j] = min(x,y,z)
>     return c[n,m]
>
> As mentioned above this works quite well and I am happy with it, but I
> wonder if there is a more Pythonic way of doing this type of lookup?
>
>     jab

Here is my stab at it, didn't fully test it so it may not work
correctly. The tuples could be avoided by using len(x), but the extra
lookups may cost in execution speed[1].

def distance(a, b):
    """
    Computes the levenshtein distance between two strings
    """
    m, n = (len(a),a), (len(b),b)
    if(m[0] < n[0]):                #ensure that the 'm' tuple holds
the longest string
        m, n = n, m
    dist = m[0]                     #assume distance = length of
longest string (worst case)
    for i in range(0, n[0]):       # reduce the distance for each char
match in shorter string
        if m[1][i] == n[1][i]:
            dist = dist - 1
    return dist

[1] I have no if this is true or not. I can see arguments for both.




More information about the Python-list mailing list