Fuzzy string comparison

Carsten Haese carsten at uniqsys.com
Tue Dec 26 16:46:41 EST 2006


On Tue, 2006-12-26 at 13:08 -0800, John Machin wrote:
> Wojciech Mula wrote:
> > Steve Bergman wrote:
> > > I'm looking for a module to do fuzzy comparison of strings. [...]
> >
> > Check module difflib, it returns difference between two sequences.
> 
> and it's intended for comparing text files, and is relatively slow.
> 
> Google "python levenshtein". You'll probably find this a better fit for
> typoed keys in a database.
> [...]

Using the Levenshtein distance in combination with stripping "noise"
characters is a good start, but the OP might want to take it a step
further. One of the OP's requirements is to recognize visually similar
strings, but 241O (Letter O at the end) and 241X have the same
Levenshtein distance from 2410 (digit zero at the end) while the former
is visually much closer to 2410 than the latter.

It seems to me that this could be achieved by starting with a standard
Levenshtein implementation such as http://hetland.org/python/distance.py
and altering the line "change = change + 1" to something like "change =
change + visual_distance(a[j-1], b[i-1])". visual_distance() would be a
function that embodies the OP's idea of which character replacements are
tolerable by returning a number between 0 (the two characters are
visually identical) and 1 (the two characters are completely different).

Hope this helps,

-Carsten





More information about the Python-list mailing list