What is the most efficient way to compare similar contents in two lists?

Mon Jun 13 14:11:14 EDT 2011

On Tue, Jun 14, 2011 at 3:58 AM, Dan Stromberg <drsalists at gmail.com> wrote:
>
> This is a beautiful solution, and yet I feel compelled to mention that it
> disregards duplicates within a given list.  If you need duplicate
> detection/differencing, it's better to sort each list and then use an
> algorithm similar to the merge step of mergesort.

The original example used the 'in' operator, which is effectively a
set operation anyway. As written, it would count all duplicates in the
source headers but only count one in the target. I'm guessing that the
context mandates no duplicates (say, if they're dictionary keys or
something - let's assume float("nan") is not a key).

> Using sets as above is O(n), while the sorting version is O(nlogn) usually.
> O(n) is better than O(nlogn).
>
> And of course, the version based on sorting assumes order doesn't matter.

If order and duplicates matter, then you want a completely different
diff. I wrote one a while back, but not in Python. The algorithm went
something like this:

* Start with pointers to beginnings of both lists.
* See if current string is identical in each list; if so, increment
pointers and iterate.
* Once a difference is found, try to find a re-match by incrementing
pointer #1 until the two match. If the end of the first list is
reached, emit current line of #2 as a deletion and point pointer #1 to
just after where it started.
* On finding a re-match, emit all list #1 lines from first difference
to rematch as insertions.

(Since that was for comparing a source file with a user-supplied
modified source - a sort of diff/patch - a re-match was defined by N
consecutive matching lines, but in this, a re-match can simply be two
identical strings.)

But for the situation given, I'm assuming that a simpler solution suffices.

Chris Angelico