[Python-Dev] Re: heaps

Mon, 05 May 2003 22:35:28 -0400

[David Eppstein, on the bar-raising behavior of

     person > q[0]
]

> Good point.  If any permutation of the input sequence is equally likely,
> and you're selecting the best k out of n items, the expected number of
> times you have to hit the data structure in your heapq solution
> is roughly k ln n, so the total expected time is O(n + k log k log n),
> with a really small constant factor on the O(n) term.  The sorting
> solution I suggested has total time O(n log k), and even though sorting
> is built-in and fast it can't compete when k is small.   Random pivoting
> is O(n + k), but with a larger constant factor, so your heapq solution
> looks like a winner.

In real Python Life, it's the fastest way I know (depending ...).

> For fairness, it might be interesting to try another run of your test
> in which the input sequence is sorted in increasing order rather
> than random.

Comparing the worst case of one against the best case of the other isn't my
idea of fairness <wink>, but sure.  On the best-1000 of a million floats
test, and sorting the floats first, the heap method ran about 30x slower
than on random data, and the sort method ran significantly faster than on
random data (a factor of 1.3x faster).  OTOH, if I undo my speed tricks and
call a function in the sort method (instead of doing it all micro-optimized
inline), that slows the sort method by a bit over a factor of 2.

> I.e., replace the random generation of seq by
>     seq = range(M)
> I'd try it myself, but I'm still running python 2.2 and haven't
> installed heapq.  I'd have to know more about your application to
> have an idea whether the sorted or randomly-permuted case is more
> representative.

Of course --  so would I <wink>.

Here's a surprise:  I coded a variant of the quicksort-like partitioning
method, at the bottom of this mail.  On the largest-1000 of a million
random-float case, times were remarkably steady across trials (i.e., using a
different set of a million random floats each time):

heapq                    0.96 seconds
sort (micro-optimized)   3.4  seconds
KBest (below)            2.6  seconds

The KBest code creates new lists with wild abandon.  I expect it does better
than the sort method anyway because it gets to exploit its own form of
"raise the bar" behavior as more elements come in.  For example, on the
first run, len(buf) exceeded 3000 only 14 times, and the final pivot value
each time is used by put() as an "ignore the input unless it's bigger than
that" cutoff:

pivoted w/ 0.247497558554
pivoted w/ 0.611006884768
pivoted w/ 0.633565558936
pivoted w/ 0.80516673256
pivoted w/ 0.814304890889
pivoted w/ 0.884660572175
pivoted w/ 0.89986744075
pivoted w/ 0.946575251872
pivoted w/ 0.980386533221
pivoted w/ 0.983743795382
pivoted w/ 0.992381911217
pivoted w/ 0.994243625292
pivoted w/ 0.99481443021
pivoted w/ 0.997044443344

The already-sorted case is also a bad case for this method, because then the
pivot is never big enough to trigger the early exit in put().

def split(seq, pivot):
    lt, eq, gt = [], [], []
    lta, eqa, gta = lt.append, eq.append, gt.append
    for x in seq:
        c = cmp(x, pivot)
        if c < 0:
            lta(x)
        elif c:
            gta(x)
        else:
            eqa(x)
    return lt, eq, gt

# KBest(k, minusinf) remembers the largest k objects
# from a sequence of objects passed one at a time to
# put().  minusinf must be smaller than any object
# passed to put().  After feeding in all the objects,
# call get() to retrieve a list of the k largest (or
# as many as were passed to put(), if put() was called
# fewer than k times).

class KBest(object):
    __slots__ = 'k', 'buflim', 'buf', 'cutoff'

    def __init__(self, k, minusinf):
        self.k = k
        self.buflim = 3*k
        self.buf = []
        self.cutoff = minusinf

    def put(self, obj):
        if obj <= self.cutoff:
            return

        buf = self.buf
        buf.append(obj)
        if len(buf) <= self.buflim:
            return

        # Reduce len(buf) by at least one, by retaining
        # at least k, and at most len(buf)-1, of the
        # largest objects in buf.
        from random import choice
        sofar = []
        k = self.k
        while len(sofar) < k:
            pivot = choice(buf)
            buf, eq, gt = split(buf, pivot)
            sofar.extend(gt)
            if len(sofar) < k:
                sofar.extend(eq[:k - len(sofar)])

        self.buf = sofar
        self.cutoff = pivot

    def get(self):
        from random import choice
        buf = self.buf
        k = self.k
        if len(buf) <= k:
            return buf

        # Retain only the k largest.
        sofar = []
        needed = k
        while needed:
            pivot = choice(buf)
            lt, eq, gt = split(buf, pivot)
            if len(gt) <= needed:
                sofar.extend(gt)
                needed -= len(gt)
                if needed:
                    takefromeq = min(len(eq), needed)
                    sofar.extend(eq[:takefromeq])
                    needed -= takefromeq
                # If we still need more, they have to
                # come out of things < pivot.
                buf = lt
            else:
                # gt alone is too large.
                buf = gt

        assert len(sofar) == k
        self.buf = sofar
        return sofar