exercise: partition a list by equivalence

Fri Feb 18 18:21:10 EST 2005

John Lenton wrote:
> On Thu, Feb 17, 2005 at 03:46:20PM -0800, Xah Lee wrote:
> > here's another interesting algorithmic exercise, again from part of
a
> > larger program in the previous series.
> >
> > Here's the original Perl documentation:
> >
> > =pod
> >
> > merge($pairings) takes a list of pairs, each pair indicates the
> > sameness
> > of the two indexes. Returns a partitioned list of same indexes.
> >
> > For example, if the input is
> > merge( [ [1,2], [2,4], [5,6] ] );
> >
> > that means 1 and 2 are the same. 2 and 4 are the same. Therefore
> > 1==2==4. The result returned is
> >
> > [[4,2,1],[6,5]];
> >
> > (ordering of the returned list and sublists are not specified.)
> >
> > =cut
>
> in Python:
>
>     def merge(pairings):
>         rev = {}
>         res = []
>         for pairing in pairings:
>             first, second = pairing
>             has_first = first in rev
>             has_second = second in rev
Not robust in the face of input like:
[[1,1]]
or
[[1,2], [1,2]]
or
[[1,2], [2,1]]
or
[[1,2], [2,3], [3,1]]

needs "if first == second: continue" here

>             if has_first and has_second:

needs "if rev[first] == rev[second]: continue" here

>                 rev[first].extend(rev[second])
>                 rev[second][:] = []
>                 rev[second] = rev[first]
>             elif has_first:
>                 rev[first].append(second)
>                 rev[second] = rev[first]
>             elif has_second:
>                 rev[second].append(first)
>                 rev[first] = rev[second]
>             else:
>                 copy = [first, second]
>                 res.append(copy)

My reaction to the "magic" by which res grows was "omigod that's the
most horrible thing I've seen for quite a while" but there was worse to
come :-)

>                 rev[first] = rev[second] = copy
>         return filter(None, res)
>
> and in Perl:
>
[snip]

>
> Both versions are (IMHO) pretty clear (when the docstring/pod is
> included), and O(n) because dict/hash lookup and list
> appending/pushing is O(1) in both languages. Both versions can
> probably be tweaked for speed quite a bit, but I don't *think*
there's
> a better-than-O(n) algorithm for this.

List appending is O(1) only in the amortised sense. Extending is not
O(1) in any sense. Neither is the list comparison that is necessary for
robustness (using your data structure, that is).

You don't need to think. This problem has been extensively picked over
by folk who are a lot smarter than us, starting from 30 years ago.
Google for "disjoint set" and "union-find". One gets the impression
that the best possible algorithm would be slightly *worse* than O(n).

>
> Note that the Python version assumes that the pairs' elements are
> hashable; your example used numbers, so I thought it was pretty safe
> assumption. The Perl version has no such restriction.

In the real world, the objects under comparison (images, fingerprints,
customer records, etc) may not themselves be hashable and comparison
for equality may be expensive but a unique identifier would be assigned
to each object and that ID would of course be cheaply hashable and
comparable.