Fuzzy string comparison

John Machin sjmachin at lexicon.net
Tue Dec 26 16:08:34 EST 2006


Wojciech Mula wrote:
> Steve Bergman wrote:
> > I'm looking for a module to do fuzzy comparison of strings. [...]
>
> Check module difflib, it returns difference between two sequences.

and it's intended for comparing text files, and is relatively slow.

Google "python levenshtein". You'll probably find this a better fit for
typoed keys in a database.

What I would do for a quick start on this exercise (as described) is:

To compare two strings, take copies, and:
1. strip out all spaces (including \xA0 i.e.   if the data has
been anywhere near the web) from each string; also all "-" (in general
strip frequently occurring meaningless punctuation)
2. remove leading zeroes from each string
3. d = levenshtein_distance(string_a, string_b) # string_a etc is the
reduced string, not the original
4. error_metric = float(d) / max(len(string_a), len(string_b))

The error_metric will be 0.0 if the strings are the same (after
removing spaces, leading zeroes, etc) and 1.0 if they are completely
different (no characters in common).

... and you don't want anything "kind of like soundex". That's a bit
like saying you'd like to travel in an aeroplane "kind of like the
Wright brothers' " ;-)

Cheers,
John




More information about the Python-list mailing list