Nasty gotcha/bug in heapq.nlargest/nsmallest

Thu May 15 02:44:42 EDT 2008

En Wed, 14 May 2008 23:47:56 -0300, George Sakkis <george.sakkis at gmail.com> escribió:

> I spent several hours debugging some bogus data results that turned
> out to be caused by the fact that heapq.nlargest doesn't respect rich
> comparisons:
>
>     import heapq
>     import random
>
>     class X(object):
>         def __init__(self, x): self.x=x
>         def __repr__(self): return 'X(%s)' % self.x
>         if True:
>             # this succeeds
>             def __cmp__(self, other): return cmp(self.x , other.x)
>         else:
>             # this fails
>             def __lt__(self, other): return self.x < other.x
>
>     s = [X(i) for i in range(10)]
>     random.shuffle(s)
>
>     s1 = heapq.nlargest(5, s)
>     s2 = sorted(s, reverse=True)[:5]
>     assert s1 == s2, (s,s1,s2)
>
>     s1 = heapq.nsmallest(5, s)
>     s2 = sorted(s)[:5]
>     assert s1 == s2, (s,s1,s2)
>
>
> According to the docs, nlargest is equivalent to: "sorted(iterable,
> key=key, reverse=True)[:n]" and similarly for nsmallest. So that must
> be at least a documentation bug, if not an implementation one.

Your class is not properly defined. sorted() only uses < to compare items, but this fact is NOT documented as far as I can tell. So you are abusing this implementation detail by not defining the other comparison operators, and the X class has a rather strange behavior:

py> X(1)<X(2)
True
py> X(1)<=X(2)
False  # !!!
py> X(1)>X(2)
False
py> X(1)==X(1)
False  # !!!

If you write all the six rich comparison methods (__lt__, __le__, etc) nlargest and nsmallest work fine.
If your objects are fully comparable (like numbers; that is, given A and B, one and only one of these holds: A<B, A==B, A>B) the easiest way is to write a helper function (like the old __cmp__) and implement the rich comparison methods using it. Those six methods could be defined in a mixin class.

-- 
Gabriel Genellina