Aproximative string matching

Diez B. Roggisch deets at nospam.web.de
Mon Nov 21 13:47:45 EST 2005


Steven D'Aprano wrote:
> elbertlev at hotmail.com wrote:
> 
>> This algorithm is called soundex. Here is one implementation example.
>>
>> http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52213
>>
>> here is another:
>> http://effbot.org/librarybook/soundex.htm
> 
> 
> Soundex is *one* particular algorithm for approximate string matching. 
> It is optimised for matching Anglo-American names (like Smith/Smythe), 
> and is considered to be quite old and obsolete for all but the most 
> trivial applications -- or so I'm told.
> 
> Soundex will not match arbitrary changes -- it will match both cat and 
> cet, but it won't match cat and mat.
> 
> A more sophisticated approximate string matching algorithm will use the 
> Levenshtein distance. You can find a Useless implementation here:
> 
> http://www.uselesspython.com/download.php?script_id=108
> 
> 
> Given a function levenshtein(s1, s2) that returns the distance between 
> two strings, you could use it for approximate matching like this:
> 
> def approx_matching(strlist, target, dist=1):
>     """Matches approximately strings in strlist to
>     a target string.
> 
>     Returns a list of strings, where each string
>     matched is no further than an edit distance of
>     dist from the target.
>     """
>     found = []
>     for s in strlist:
>         if levenshtein(s, target) <= dist:
>             found.append(s)
>     return s

I compute a relative metric based on th every same implementation like this:

def relative(a, b):
     """
     Computes a relative distance between two strings. Its in the range
     (0-1] where 1 means total equality.
     @type a: string
     @param a: arg one
     @type b: string
     @param b: arg two
     @rtype: float
     @return: the distance
     """
     d = levensthein(a,b)
     longer = float(max((len(a), len(b))))
     shorter = float(min((len(a), len(b))))
     r = ((longer - d) / longer) * (shorter / longer)
     return r


The idea is that otherwise e.g. "cat" and "hippopothamus" have a 
l-distance of only 3, which one would consider good at the first look.


Regards,

Diez



More information about the Python-list mailing list