Fuzzy Lookups

Gregory Piñero gregpinero at gmail.com
Mon Jan 30 23:10:13 EST 2006


Ok, ok, I got it!  The Pythonic way is to use an existing library ;-)

import difflib
CloseMatches=difflib.get_close_matches(AFileName,AllFiles,20,.7)

I wrote a script to delete duplicate mp3's by filename a few years
back with this.  If anyone's interested in seeing it, I'll post a blog
entry on it.  I'm betting it uses a similiar algorithm your functions.

-Greg


On 30 Jan 2006 17:28:47 -0800, ajones <ajones1 at gmail.com> wrote:
>
> BBands wrote:
> > I have some CDs and have been archiving them on a PC. I wrote a Python
> > script that spans the archive and returns a list of its contents:
> > [[genre, artist, album, song]...]. I wanted to add a search function to
> > locate all the versions of a particular song. This is harder than you
> > might think. For example the Cajun "national anthem" is Jolie Blond,
> > but it can be spelled several different ways jolie, joli, blon, blond,
> > etc... In addition the various online services that provide song info
> > are riddled with typos, so an ordinary string match just doesn't get
> > you there. What is needed is a fuzzy string match and it turns out that
> > there is a very good one, the Levenshtein distance, which is the number
> > of inserts, deletions and substitutions needed to morph one string into
> > another. In my application I match the desired song title against all
> > song titles in my list and return the ones with the lowest Levenshtein
> > distances. This is remarkably, one might even say stunningly,
> > effective, easily finding all the version of Jolie Blon in the list.
> >
> > I am using the following snippet (found on the web, proper attribution
> > unsure), which calculates the Levenshtein distance.
> >
> > def distance(a,b):
> >     c = {}
> >     n = len(a); m = len(b)
> >
> >     for i in range(0,n+1):
> >         c[i,0] = i
> >     for j in range(0,m+1):
> >         c[0,j] = j
> >
> >     for i in range(1,n+1):
> >         for j in range(1,m+1):
> >             x = c[i-1,j]+1
> >             y = c[i,j-1]+1
> >             if a[i-1] == b[j-1]:
> >                 z = c[i-1,j-1]
> >             else:
> >                 z = c[i-1,j-1]+1
> >             c[i,j] = min(x,y,z)
> >     return c[n,m]
> >
> > As mentioned above this works quite well and I am happy with it, but I
> > wonder if there is a more Pythonic way of doing this type of lookup?
> >
> >     jab
>
> Here is my stab at it, didn't fully test it so it may not work
> correctly. The tuples could be avoided by using len(x), but the extra
> lookups may cost in execution speed[1].
>
> def distance(a, b):
>     """
>     Computes the levenshtein distance between two strings
>     """
>     m, n = (len(a),a), (len(b),b)
>     if(m[0] < n[0]):                #ensure that the 'm' tuple holds
> the longest string
>         m, n = n, m
>     dist = m[0]                     #assume distance = length of
> longest string (worst case)
>     for i in range(0, n[0]):       # reduce the distance for each char
> match in shorter string
>         if m[1][i] == n[1][i]:
>             dist = dist - 1
>     return dist
>
> [1] I have no if this is true or not. I can see arguments for both.
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>


--
Gregory Piñero
Chief Innovation Officer
Blended Technologies
(www.blendedtechnologies.com)



More information about the Python-list mailing list