Aproximative string matching
Diez B. Roggisch
deets at nospam.web.de
Mon Nov 21 13:47:45 EST 2005
Steven D'Aprano wrote:
> elbertlev at hotmail.com wrote:
>
>> This algorithm is called soundex. Here is one implementation example.
>>
>> http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52213
>>
>> here is another:
>> http://effbot.org/librarybook/soundex.htm
>
>
> Soundex is *one* particular algorithm for approximate string matching.
> It is optimised for matching Anglo-American names (like Smith/Smythe),
> and is considered to be quite old and obsolete for all but the most
> trivial applications -- or so I'm told.
>
> Soundex will not match arbitrary changes -- it will match both cat and
> cet, but it won't match cat and mat.
>
> A more sophisticated approximate string matching algorithm will use the
> Levenshtein distance. You can find a Useless implementation here:
>
> http://www.uselesspython.com/download.php?script_id=108
>
>
> Given a function levenshtein(s1, s2) that returns the distance between
> two strings, you could use it for approximate matching like this:
>
> def approx_matching(strlist, target, dist=1):
> """Matches approximately strings in strlist to
> a target string.
>
> Returns a list of strings, where each string
> matched is no further than an edit distance of
> dist from the target.
> """
> found = []
> for s in strlist:
> if levenshtein(s, target) <= dist:
> found.append(s)
> return s
I compute a relative metric based on th every same implementation like this:
def relative(a, b):
"""
Computes a relative distance between two strings. Its in the range
(0-1] where 1 means total equality.
@type a: string
@param a: arg one
@type b: string
@param b: arg two
@rtype: float
@return: the distance
"""
d = levensthein(a,b)
longer = float(max((len(a), len(b))))
shorter = float(min((len(a), len(b))))
r = ((longer - d) / longer) * (shorter / longer)
return r
The idea is that otherwise e.g. "cat" and "hippopothamus" have a
l-distance of only 3, which one would consider good at the first look.
Regards,
Diez
More information about the Python-list
mailing list