Fuzzy string matching?

Al Christians achrist at easystreet.com
Thu Aug 26 22:17:33 EDT 1999


I've gotten good results with ad hoc algorithms using a longest common
contiguous substring routine.  There is an algorithm in the _Algorithms_
book by Rivest, et al, that produces the longest common non-contiguous
substring, which might be a better indicator of a match, but modifying
it to test for only contiguous substrings improves efficiency much,
particularly space efficiency.  The lengths of the 
two or three longest common contiguous substrings give a pretty good
indication of the degree of match in the applications I've tried (name
and address cleanup).   How to combine these lengths into a scalar
measure of match is the really ad-hoc part of it.  


Al




More information about the Python-list mailing list