A simply newbie question about ndiff

Sun Apr 28 21:49:25 EDT 2002

[Neville Franks]
> Well believe it or not but there are quite a few people using ED that
> have source files in excess of 70K [lines]. Not something I would
> advocate but ...
>
> On a more serious note I do have folks who compare large'ish data
> (ascii text) files. These are typically dumps of some sort. Apart from
> being large they can also get less than optimal results as they can
> have large numbers of repeated lines which throws my diff out.

I just checked in a new version of difflib.py's SequenceMatcher class that
does a dynamic analysis of which elements are so frequently repeated as to
constitute noise.  The innermost matching loop can skip those at first, and
that can yield an enormous speedup.  It looks much more effective than the
current IS_LINE_JUNK gimmick.  For example, I took a 100K-line C source
file, made a copy, changed a line in the middle, and it took less than 15
seconds to do the get_opcodes() bit, including the time to read the files
and break them into lines.  I still doubt it would be a heck of a lot faster
in C++, because this mostly exercises the speed of dicts, and Python's
string dicts are extremely efficient.  I can easily imagine it running much
slower in C++, though!

> I've organized a copy of ED for you to play with and would be
> interested to hear what you think of it's Diff capabilities. ED
> doesn't have Python support just yet, but it is planned. It does
> support some 34+ languages though. A Free Trial is available at
> www.getsoft.com for anyone else who may be interested.

Thanks!  I got around to grabbing a copy last night but haven't had time to
play with it yet.  I can testify that it installed without a hitch <wink>.