difflib

Tue Apr 24 07:39:38 EDT 2001

Mikkel Rasmussen <footech at get2net.dk> wrote in message
news:5mSE6.44$677.2732 at news.get2net.dk...
>
> Tim Peters <tim.one at home.com> wrote in message
> news:mailman.987979892.8018.python-list at python.org...
> > [Mikkel Rasmussen]
> > > Has anybody got any references for the algorithm used in difflib.
> >
> > Nope.  It's an algorithm I dreamed up in the early 80's, while at Cray
> > Research, writing a file-differencing tool.  I've carried it with me
ever
> > since, reimplementing it in at least a dozen languages along the way.
The
> > basic idea popped up when staring at some incomprehensible "diff"
output,
> and
> > wondering "hmmmm -- what if diff restricted itself to *contiguous*
> > subsequences?  and then what if it didn't synch up on junk either?".
> That's
> > about it.  You can find more discussion in the comments in module
ndiff.py
> > (difflib's SequenceMatcher class started its life in ndiff.py years ago;
> it
> > simply got split out into its own module for 2.1).
> >
> > When Eric Raymond mentioned the Ratcliff-Obershelp algorithm on
> Python-Dev,
> > it was news to me, and a Google search quickly turned up a C
> implementation
> > of that.  It was obviously the same basic algorithm, except without the
> "skip
> > junk" gimmick, and at least the implementations I found were much less
> > efficient than SequenceMatcher (this is one of those happy instances
where
> my
> > Python code runs *faster* than previous C, Fortran and Pascal
> > implementations:  I had the *idea* of chaining together hash lists to
> speed
> > the search a long time ago, but never had the patience to *implement*
that
> > before recoding in Python -- it's easy in Python).
> >
>
> I have now tested the difflib algorithm and my own algorithm. The difflib
is
> about twice as fast as my own (which is not optimised in any way). They
> return near identical results. It will be interesting to look closer at
the
> difflib algorithm.
>
> Mikkel Rasmussen
>
>

I have now combined some of the difflib rudimentary functions with my own
sequence matcher (fuzzy match) algorithm. My own algorithm now runs ca. as
fast as the get_close_matches function, but it returns better results for a
large number of possibilities.

Thanks for the info.

Mikkel Rasmussen