difflib

Mikkel Rasmussen footech at get2net.dk
Mon Apr 23 03:01:36 EDT 2001


Tim Peters <tim.one at home.com> wrote in message
news:mailman.987979892.8018.python-list at python.org...
> [Mikkel Rasmussen]
> > Has anybody got any references for the algorithm used in difflib.
>
> Nope.  It's an algorithm I dreamed up in the early 80's, while at Cray
> Research, writing a file-differencing tool.  I've carried it with me ever
> since, reimplementing it in at least a dozen languages along the way.  The
> basic idea popped up when staring at some incomprehensible "diff" output,
and
> wondering "hmmmm -- what if diff restricted itself to *contiguous*
> subsequences?  and then what if it didn't synch up on junk either?".
That's
> about it.  You can find more discussion in the comments in module ndiff.py
> (difflib's SequenceMatcher class started its life in ndiff.py years ago;
it
> simply got split out into its own module for 2.1).
>
> When Eric Raymond mentioned the Ratcliff-Obershelp algorithm on
Python-Dev,
> it was news to me, and a Google search quickly turned up a C
implementation
> of that.  It was obviously the same basic algorithm, except without the
"skip
> junk" gimmick, and at least the implementations I found were much less
> efficient than SequenceMatcher (this is one of those happy instances where
my
> Python code runs *faster* than previous C, Fortran and Pascal
> implementations:  I had the *idea* of chaining together hash lists to
speed
> the search a long time ago, but never had the patience to *implement* that
> before recoding in Python -- it's easy in Python).
>
> > ...
> > Are there any explanations available elsewhere?
>
> You can find a little more by doing a google search.
>
>

I have written my own sequence matcher as a part of my master thesis in
computational linguistics (I have made a spelling checker for dyslexics), so
I have a fair amount of information on various fuzzy match algorithms, but I
couldn't find anything useful via Google on the difflib algorithm. The
origin of the algorithm wasn't exactly clear from the documentation :-)

I'm going to test your algorithm later today to see if it outperforms my own
both in quality and speed. Quality is the most important. Speed is just
nice.

Mikkel Rasmussen






More information about the Python-list mailing list