Minimal diffs in difflib?

Thu Jan 3 13:59:12 EST 2002

[Magnus Lie Hetland]
> When asking about using difflib to implement version control, I was
> pointed to the get_opcodes method of SequenceMatcher by Tim Peters,
> and that was indeed helpful. And after looking a bit at difflib, I
> think the "intuitive" deltas computed are very nice for showing users
> the difference between one revision and the previous one.

Indeed, I dreamt up the SequenceMatcher algorithm in the early 80's *for* a
source-control system, precisely because "minimal diffs" were too often
unintuitive:  patches are read more often by people than by source control
systems, and saving a few bytes wasn't worth it even back then.

> However, I think it would be nice to be able to compute minimal deltas
> too, to increase the compression of the repository, and so I started
> wondering...

Quantify first.  Except in pathological cases, SequenceMatcher diffs are
usually size-competitive with minimal diffs, they simply don't synch up on
ridiculously accidental matches.  In non-pathological cases, a synch tool
generally has *many* potential matches to choose from, and most equally good
in the end (wrt total diff size); SequenceMatcher strives to pick the pair
with maximal matching local context, which very often corresponds to how the
source was actually edited.

> Would it be interesting to include the Levenshtein algorithm in difflib
> as well?

Not to me <wink>.

> (Or possibly add a separate module or something?)

difflib may turn into a pkg someday, if enough stuff gets added to it.  In
the meantime, other diff methods "belong" there.

> The basic algorithm is very small/simple[1], so adding it
> wouldn't be much of a burden to the standard lib, would it? (And PHP
> has it, so why shouldn't we? <wink>)

An industrial-strength version isn't small or simple -- study your Unix diff
source code for why.