Fuzzy Lookups
Gregory Piñero
gregpinero at gmail.com
Tue Jan 31 16:24:46 EST 2006
I wonder which algorithm determines the similarity between two strings better?
On 1/31/06, Kent Johnson <kent at kentsjohnson.com> wrote:
> Gregory Piñero wrote:
> > Ok, ok, I got it! The Pythonic way is to use an existing library ;-)
> >
> > import difflib
> > CloseMatches=difflib.get_close_matches(AFileName,AllFiles,20,.7)
> >
> > I wrote a script to delete duplicate mp3's by filename a few years
> > back with this. If anyone's interested in seeing it, I'll post a blog
> > entry on it. I'm betting it uses a similiar algorithm your functions.
>
> A quick trip to difflib.py produces this description of the matching
> algorithm:
>
> The basic
> algorithm predates, and is a little fancier than, an algorithm
> published in the late 1980's by Ratcliff and Obershelp under the
> hyperbolic name "gestalt pattern matching". The basic idea is to find
> the longest contiguous matching subsequence that contains no "junk"
> elements (R-O doesn't address junk). The same idea is then applied
> recursively to the pieces of the sequences to the left and to the
> right of the matching subsequence.
>
> So no, it doesn't seem to be using Levenshtein distance.
>
> Kent
> --
> http://mail.python.org/mailman/listinfo/python-list
>
--
Gregory Piñero
Chief Innovation Officer
Blended Technologies
(www.blendedtechnologies.com)
More information about the Python-list
mailing list