Fuzzy string matching

Magnus L. Hetland mlh at idt.ntnu.no
Fri Aug 27 09:51:03 EDT 1999


Explanation of the distance algorithm...

(Got a request for this, and thought I might as well post it...)

The algorithm:

def distance(a,b):
    c = {}
    n = len(a); m = len(b)

    for i in range(0,n+1):
        c[i,0] = i
    for j in range(0,m+1):
        c[0,j] = j
        
    for i in range(1,n+1):
        for j in range(1,m+1):
            x = c[i-1,j]+1
            y = c[i,j-1]+1
            if a[i-1] == b[j-1]:
                z = c[i-1,j-1]
            else:
                z = c[i-1,j-1]+1
            c[i,j] = min(x,y,z)
    return c[n,m]

It calculates the following: Given two strings, a and b, and three
operations, adding, subtracting and exchanging single characters, what
is the minimal number of steps needed to translate a into b?

The method is based on the following idea:

  We want to find the distance between a[:x] and b[:y]. To do this, we
  first calculate

  1) the distance between a[:x-1] and b[:y], adding the cost of a
     subtract-operation, used to get from a[:x] to a[:z-1];

  2) the distance between a[:x] and b[:y-1], adding the cost of an
     addition-operation, used to get from b[:y-1] to b[:y];

  3) the distance between a[:x-1] and b[:y-1], adding the cost of a
     *possible* exchange of the letter b[y] (with a[x]).

The cost of the subtraction and addition operations are 1, while the
exchange operation has a cost of 1 if a[x] and b[y] are different, and
0 otherwise.

After calculating these costs, we choose the least one of them (since
we want to use the best solution.)

Instead of doing this recursively (i.e. calculating ourselves "back"
from the final value), we build a cost-matrix c containing the optimal
costs, so we can reuse them when calculating the later values. The
costs c[i,0] (from string of length n to empty string) are all i, and
correspondingly all c[0,j] (from empty string to string of length j)
are j.

Finally, the cost of translating between the full strings a and b
(c[n,m]) is returned.

I guess that ought to cover it...

--

  Magnus              Making no sound / Yet smouldering with passion
  Lie          The firefly is still sadder / Than the moaning insect
  Hetland                                       : Minamoto Shigeyuki




More information about the Python-list mailing list