diff lists

Alex Martelli aleaxit at yahoo.com
Fri Mar 30 07:30:05 EST 2001


"Steven D. Majewski" <sdm7g at Virginia.EDU> wrote in message
news:mailman.985914639.5108.python-list at python.org...
    [snip]
> On 29 Mar 2001, Michael Hudson wrote:
> > Tangentially, does anyone know of any good algorithms for "edit
> > distance" between two sequences?  E.g. if I have
> > "abcdef"
> > and want to get to
> > "abQUACKcde"
> > I want to get the answer back "insert 'QUACK' at position 3 and delete
> > a character at position 11".
> > "Good" means "pretty quick", here.
    [snip]
> http://www.ling.ohio-state.edu/~cbrew/795M/string-distance.html
>
> However, the algorithm for the distance does not give location, but
> maybe this info can be extracted from the intermediate matrix.

Thanks Steve for the good pointers -- the latter in particular
(by Chris Brew) is expressed in pretty-clear Pythonish pseudocode
(although incomplete and suffering a typo or two) so it was not
hard to transliterate -- here are both versions, distance only
and full-path.  I've CC'd Chris (as credit for using his page:-)

[Chris, the incompleteness is on the second half of what I
called edit_path, and the typo is in your line which goes
     d(i+1,0) = (sofar + delCost,DEL,d(0,j)) (**)
where it should be
     d(i+1,0) = (sofar + delCost,DEL,d(i,o)) (**)
but, mainly, thanks for that page!-)].


def atomDistance(char1, char2, substCost=1.0):
    if char1==char2: return 0
    return substCost

def edit_dist(string1, string2, delCost=1.0, insCost=1.0,
atomDistance=atomDistance):
    m = len(string1)
    n = len(string2)
    d = [ [0.0]*(n+1) for i in range(m+1) ]
    for i in range(m):
        d[i+1][0] = d[i][0] + delCost
    for j in range(n):
        d[0][j+1] = d[0][j] + insCost
    for i in range(m):
        for j in range(n):
            subst = atomDistance(string1[i], string2[j])
            d[i+1][j+1] = min(d[i][j]+subst,
                              d[i+1][j]+insCost,
                              d[i][j+1]+delCost)
    return d[m][n]

def edit_path(string1, string2, delCost=1.0, insCost=1.0,
atomDistance=atomDistance):
    m = len(string1)
    n = len(string2)
    d = [ [(0.0,'Start',None)]*(n+1) for i in range(m+1) ]
    for i in range(m):
        cost_so_far, move, tb = d[i][0]
        d[i+1][0] = cost_so_far+delCost, 'Del(%s)'%string1[i], d[i][0]
    for j in range(n):
        cost_so_far, move, tb = d[0][j]
        d[0][j+1] = cost_so_far+insCost, 'Ins(%s)'%string2[j], d[0][j]
    for i in range(m):
        for j in range(n):

            subst = atomDistance(string1[i], string2[j])
            if string1[i]==string2[j]:
                submove = 'Ok(%s)'%string1[i]
            else:
                submove = 'Sub(%s->%s)'%(string1[i],string2[j])
            cost_so_far, move, tb = d[i][j]
            cost_with_sub = cost_so_far + subst

            cost_so_far, move, tb = d[i+1][j]
            cost_with_ins = cost_so_far + insCost
            insmove = 'Ins(%s)'%string2[j]

            cost_so_far, move, tb = d[i][j+1]
            cost_with_del = cost_so_far + delCost
            delmove = 'Del(%s)'%string1[i]

            if cost_with_sub < cost_with_ins:
                if cost_with_sub < cost_with_del:
                    d[i+1][j+1] = cost_with_sub, submove, d[i][j]
                else:
                    d[i+1][j+1] = cost_with_del, delmove, d[i][j+1]
            else:
                if cost_with_ins < cost_with_del:
                    d[i+1][j+1] = cost_with_ins, insmove, d[i+1][j]
                else:
                    d[i+1][j+1] = cost_with_del, delmove, d[i][j+1]
    return d[m][n]

def unravel(chain):
    cost, move, previous = chain
    if previous is None:
        return [move]
    else:
        return unravel(previous) + [move]

def showedit(s1, s2):
    print "%s -> %s: %d" % (s1, s2, edit_dist(s1,s2))
    path = edit_path(s1, s2)
    print path
    print unravel(path)

if __name__=='__main__':
    import sys
    if len(sys.argv)==3:
        showedit(sys.argv[1],sys.argv[2])
    else:
        showedit("practical","rational")


Here is sample interactive use for Michel's question:

D:\pybridge>python
Python 2.0 (#8, Oct 16 2000, 17:27:58) [MSC 32 bit (Intel)] on win32
Type "copyright", "credits" or "license" for more information.
>>> import ed
>>> edit = ed.unravel(ed.sdistr('abcdef','abQUACKcde'))
>>> for move in edit:
...   print move
...
Start
Ok(a)
Ok(b)
Ins(Q)
Ins(U)
Ins(A)
Ins(C)
Ins(K)
Ok(c)
Ok(d)
Ok(e)
Del(f)
>>>


There are of course lots of little tweaks one could do to speed
this up in practice (the linked-list form of the 'chain' is neat
but probably costly, and it needs to be unraveled anyway -- and
unravel() itself might be non-recursive -- etc, etc) but I do
believe this version is quite readable and didactically valid.


Alex






More information about the Python-list mailing list