A simply newbie question about ndiff

Tim Peters tim.one at comcast.net
Mon Apr 22 00:47:22 EDT 2002


[Neville Franks]
> One thing I've noticed now is for larger files (73,000 lines)
> ndiff can take several minutes on my 1G PC, but that is another issue
> again. I assume the fact that it is being interpreted has quite a bit
> to do with this.

Not so.  I've been reimplementing ndiff since the early 80's in a variety of
languages, and the Python version is by far the fastest, due to using fancy
algorithms that were just too painful to code in lower-level languages.
ndiff is expensive because it tries very hard to produce an "intuitive"
diff.  If you don't need that kind of diff, there are any number of diff
variants that run much faster.

ndiff can be incredibly expensive when it's faced with large blocks of
mismatching lines (meaning the blocks have no lines at all in common).  Then
it does a full cross-product search, for each line in one block computing a
similarity score between it and every line in the other block.  The
best-matching pair is used as a synch point, and then the whole business is
done twice again for the mismatcking sub-blocks on each side of the synch
pair.  To keep memory use finite, it doesn't even try to remember the
similarity computations across iterations of this, and does a ton of
redundant work instead.  In the end, that can take time cubic in the number
of mismatching lines.

So, like most tools, it works best when it isn't really needed <wink>.






More information about the Python-list mailing list