Aproximative string matching

Steven D'Aprano steve at REMOVEMEcyber.com.au
Sun Nov 20 23:15:18 EST 2005


elbertlev at hotmail.com wrote:

> This algorithm is called soundex. Here is one implementation example.
> 
> http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52213
> 
> here is another:
> http://effbot.org/librarybook/soundex.htm

Soundex is *one* particular algorithm for approximate 
string matching. It is optimised for matching 
Anglo-American names (like Smith/Smythe), and is 
considered to be quite old and obsolete for all but the 
most trivial applications -- or so I'm told.

Soundex will not match arbitrary changes -- it will 
match both cat and cet, but it won't match cat and mat.

A more sophisticated approximate string matching 
algorithm will use the Levenshtein distance. You can 
find a Useless implementation here:

http://www.uselesspython.com/download.php?script_id=108


Given a function levenshtein(s1, s2) that returns the 
distance between two strings, you could use it for 
approximate matching like this:

def approx_matching(strlist, target, dist=1):
     """Matches approximately strings in strlist to
     a target string.

     Returns a list of strings, where each string
     matched is no further than an edit distance of
     dist from the target.
     """
     found = []
     for s in strlist:
         if levenshtein(s, target) <= dist:
             found.append(s)
     return s



-- 
Steven.




More information about the Python-list mailing list