Aproximative string matching
Steven D'Aprano
steve at REMOVEMEcyber.com.au
Sun Nov 20 23:15:18 EST 2005
elbertlev at hotmail.com wrote:
> This algorithm is called soundex. Here is one implementation example.
>
> http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52213
>
> here is another:
> http://effbot.org/librarybook/soundex.htm
Soundex is *one* particular algorithm for approximate
string matching. It is optimised for matching
Anglo-American names (like Smith/Smythe), and is
considered to be quite old and obsolete for all but the
most trivial applications -- or so I'm told.
Soundex will not match arbitrary changes -- it will
match both cat and cet, but it won't match cat and mat.
A more sophisticated approximate string matching
algorithm will use the Levenshtein distance. You can
find a Useless implementation here:
http://www.uselesspython.com/download.php?script_id=108
Given a function levenshtein(s1, s2) that returns the
distance between two strings, you could use it for
approximate matching like this:
def approx_matching(strlist, target, dist=1):
"""Matches approximately strings in strlist to
a target string.
Returns a list of strings, where each string
matched is no further than an edit distance of
dist from the target.
"""
found = []
for s in strlist:
if levenshtein(s, target) <= dist:
found.append(s)
return s
--
Steven.
More information about the Python-list
mailing list