What about an EXPLICIT naming scheme for built-ins?

Sun Sep 5 17:52:35 EDT 2004

[Alex Martelli]
>> Note that Raymond Hettinger added efficient nlargest() and
>> nsmallest() functions to heapq for 2.4.  Because heapq implements
>> a min-heap, only nlargest() is really "natural" for this module, but
>> nsmallest() does what it can ...

[Tim Peters]
> Hmmm, am I reading this wrong...?  getting the min N times seems
> easy, wouldn't that be nsmallest...?

You're reading it right, but it's a bit paradoxical.  Suppose you want
the N extremes from an iterable of length M, and N is much smaller
than M, and you have a min-heap to work with.

If the extremes you want are the N largest, a min-heap of size N (not
M!) suffices to get them.  See the 2.4 code for details, but it's a
simple matter of keeping the N largest seen so far.  A min-heap is
perfect for that, because the smallest of the largest N seen so far is
the smallest element of the current heap, and "the smallest" is what a
min-heap delivers most efficiently.  So the main loop looks like

    sol = result[0]         # sol --> smallest of the nlargest
    for elem in it:
        if elem <= sol:
            continue
        _heapreplace(result, elem)
        sol = result[0]

where "result" contains (just) N elements, and the iterable doesn't
need to be materialized into a giant list first.  Even better, under
many input conditions, "elem <= sol" will usually be true, so the
inner loop is very fast.  Its worst case is when the iterable is in
sorted increasing order (and then "elem <= sol" is never true, and we
have to do _heapreplace each time).

If instead you want the N smallest, a heap of (much larger) size M is
the obvious way to do it, materializing the input iterable into a
giant list (of size M), heapify()'ing that, and then heappop()'ing N
times.  That's less efficient on all counts.

So, in fact, if N is much less than M, heapq.nsmallest() actually uses
a different approach -- one that doesn't use a heap at all!

>> ... what it can.  Neither works as a generator.  There are elegant
>> ways to implement mergesort as a cascade of (O(N)!) recursive
>> generators, but they're horrendously slow in comparison.

> Ouch, I _DID_ know I had to traipse more carefully through 2.4's
> sources... I had missed this ones!!!  Thanks, they're going into the
> appropriate recipes (in the Searching and Sorting chapter -- whose
> introduction you'll no doubt have to vastly rewrite, btw...) ASAP.

I look forward to it <wink>.  Just to be clear, there is no code in
2.4 to do mergesort via recursive generators.  nlargest() and
nsmallest() are there, and they're practical.

> heapq _should_ have reversed= like list.sort, to make a maxheap
> almost as easy as a minheap (and key= too of course), but that's
> another fight^H^H^H^H^H debate...

At that point I'd regret not making these things classes.  They were
meant to be simple.  Maybe fancier stuff can go in the new-in-2.4
collections module (which, again thanks to Raymond, has a new C-coded
deque type with constant-time insert and remove from both ends, and
regardless of access pattern (no "bad cases") -- and, e.g.,
Queue.Queue uses collections.deque as its container type now).

... [timing] ...

> And what about the heapsorts...?  It probably depends on what you
> mean by 'usually', but...:
>
> with s.py being:
> import heapq, random
> 
> def top3_heapq(s):
>    return heapq.nsmallest(3, s)
> 
> def top3_sort(s):
>    return sorted(s)[:3]
> 
> x1 = range(1000)
> random.shuffle(x1)
> 
> on my old-ish iBook with Python 2.4 alpha 3 I see...:
> 
> kallisti:~/cb/cha/sys/cba alex$ python ~/cb/timeit.py -s'import s' 's.top3_heapq(s.x1)'
> 1000 loops, best of 3: 656 usec per loop
> kallisti:~/cb/cha/sys/cba alex$ python ~/cb/timeit.py -s'import s' 's.top3_sort(s.x1)'
> 100 loops, best of 3: 4e+03 usec per loop

The heapq methods are practical (as you just demonstrated).  The
recursive-generator mergesorts really aren't.  Besides just the
overhead of Python, they've got the overhead of creating ~= len(list)
generators, suspending and resuming them incessantly.  By the time the
list is big enough to overcome all those time overheads, chances are
you'll start running out of RAM.