[issue6931] dreadful performance in difflib: ndiff and HtmlDiff

Sat Jul 19 04:39:25 CEST 2014

Tim Peters added the comment:

I'm sympathetic, but I don't see a good solution here without using incompatible code.

ndiff was built to generate "the highest quality diff possible", for text written and edited by humans, where "quality" is measured by human judgment (not an abstract mathematical measure).  That was its goal, and I think it does well at that.  It wasn't intended to unwind mechanical changes made to machine-generated gibberish (to human eyes) text.

Not that such a thing has no value, but it was never ndiff's _intent_ to handle such stuff.  So the best solution here would be to use a different differencing engine.

As already noted, unified_diff skips any attempt to find "merely similar" lines, and that's the great source of expense here:  ndiff takes time essentially cubic in the number of lines, given inputs with a great many similar (but no identical) lines.  It was noted that kdiff3 does fine on these inputs very quickly, but, from the kdiff3 FAQ:  "If similar lines appear next to each other, this actually is coincidence but this fortunately is often the case."  It just so happens that in the inputs attached to this report, the first lines in each file pair are similar, and the second lines likewise, etc etc.  This allows kdiff3 to do a wonderful job without any time at all spent _trying_ to find similar lines.

But there's nothing coincidental here in what ndiff does:  it finds "_the_ most similar pair", and that's a quadratic-time operation (which turns into cubic time when repeated over & over).

I didn't write any of the HTML-generating functions, nor have I ever used them, so I'm not in a good position to judge what they "should do".  If similar lines aren't interesting in this context, then switching to unified_diff would save gobs of time.  If similar lines are interesting, then new code would be required to get at _some_ notion of that less expensively than ndiff does it (e.g., the similarity scores for all pairs could be computed once and saved away, reducing worst-case cubic time to worst-case quadratic time, but at the cost of needing additional memory quadratic in the number of lines).

I'm -1 on 11740.patch:  there is no reason, in general, to give up looking for the most similar pair - and give up on intraline differencing - simply because the total number of lines in the two blocks exceeds 99.  It would reduce diff quality for all uses of the Differ class, not just for the specific uses of Differ made indirectly (via ndiff) by the make_table function - and it would be backward-incompatible anyway.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue6931>
_______________________________________