A simply newbie question about ndiff

Neville Franks nospam-readonly at getsoft.com
Mon Apr 22 02:00:02 EDT 2002


"Tim Peters" <tim.one at comcast.net> wrote in message
news:mailman.1019450919.19653.python-list at python.org...
> [Neville Franks]
> > One thing I've noticed now is for larger files (73,000 lines)
> > ndiff can take several minutes on my 1G PC, but that is another issue
> > again. I assume the fact that it is being interpreted has quite a bit
> > to do with this.
>
> Not so.  I've been reimplementing ndiff since the early 80's in a variety
of
> languages, and the Python version is by far the fastest, due to using
fancy
> algorithms that were just too painful to code in lower-level languages.
> ndiff is expensive because it tries very hard to produce an "intuitive"
> diff.  If you don't need that kind of diff, there are any number of diff
> variants that run much faster.

Actually my entire reason for getting into Python right now was only to look
at your ndiff code to see the sort of results it delivered and also get a
feel for its space/time usage. I only discovered your ndiff over the
weekend, which is a little surprising as I've also been working on Diff's
for quite a few years now as part of my programmer's editor, ED for Windows.

I've been working on a major rewrite of ED4W over the past 3+ years and I
did look at replacing my diff code with something else which would hopefully
be a bit better (results/time/memory), but didn't find anything that fitted
the bill. I did have a look at fcomp which produced good results quickly for
small files, but managed to use all available memory and time for larger
files. Hence the interest in ndiff.

It will be very interesting to compare the results of ED's Diff and ndiff
with some of the more taxing files I get from time to time.

> ndiff can be incredibly expensive when it's faced with large blocks of
> mismatching lines (meaning the blocks have no lines at all in common).
Then
> it does a full cross-product search, for each line in one block computing
a
> similarity score between it and every line in the other block.  The
> best-matching pair is used as a synch point, and then the whole business
is
> done twice again for the mismatcking sub-blocks on each side of the synch
> pair.  To keep memory use finite, it doesn't even try to remember the
> similarity computations across iterations of this, and does a ton of
> redundant work instead.  In the end, that can take time cubic in the
number
> of mismatching lines.
>
> So, like most tools, it works best when it isn't really needed <wink>.

Yes indeed.

I wonder how it would perform in C++. From what you've said it doesn't sound
like their would be much of performance gain.

FYI on the 73,000 line file I diffed there was only one line different in
the two files. ED's diff took approx. 16 seconds and ndiff I think around
2.5 minutes.

--
Neville Franks, Author of ED for Windows - http://www.getsoft.com





More information about the Python-list mailing list