Fuzzy string comparison

John Machin sjmachin at lexicon.net
Tue Dec 26 17:23:22 EST 2006


Carsten Haese wrote:
> On Tue, 2006-12-26 at 13:08 -0800, John Machin wrote:
> > Wojciech Mula wrote:
> > > Steve Bergman wrote:
> > > > I'm looking for a module to do fuzzy comparison of strings. [...]
> > >
> > > Check module difflib, it returns difference between two sequences.
> >
> > and it's intended for comparing text files, and is relatively slow.
> >
> > Google "python levenshtein". You'll probably find this a better fit for
> > typoed keys in a database.
> > [...]
>
> Using the Levenshtein distance in combination with stripping "noise"
> characters is a good start, but the OP might want to take it a step
> further. One of the OP's requirements is to recognize visually similar
> strings, but 241O (Letter O at the end) and 241X have the same
> Levenshtein distance from 2410 (digit zero at the end) while the former
> is visually much closer to 2410 than the latter.
>
> It seems to me that this could be achieved by starting with a standard
> Levenshtein implementation such as http://hetland.org/python/distance.py
> and altering the line "change = change + 1" to something like "change =
> change + visual_distance(a[j-1], b[i-1])". visual_distance() would be a
> function that embodies the OP's idea of which character replacements are
> tolerable by returning a number between 0 (the two characters are
> visually identical) and 1 (the two characters are completely different).

Ya ya ya, I could have told him a whole lot more -- please consider
that what I did tell him was IMHO over-generous in response to an OT
question asking for assistance with performing a Google search.

... and given his keys are described as "numbers", a better example
might be 241O or 241o false-matching with 2416.

... and it might be a good idea if he ran the simplistic approach first
and saw what near-misses he actually came up with before complicating
it and slowing down what is already an O(N**2*L**2) exercise in the
traditional/novice implementation where N is the number of keys and L
is their average length.

The OP needs to think about 123456789 compared with 123426789; are they
the same account or not? What other information does he have?

HTH,
John




More information about the Python-list mailing list