What about an EXPLICIT naming scheme for built-ins?

Tim Peters tim.peters at gmail.com
Sun Sep 5 17:52:35 EDT 2004


[Alex Martelli]
>> Note that Raymond Hettinger added efficient nlargest() and
>> nsmallest() functions to heapq for 2.4.  Because heapq implements
>> a min-heap, only nlargest() is really "natural" for this module, but
>> nsmallest() does what it can ...

[Tim Peters]
> Hmmm, am I reading this wrong...?  getting the min N times seems
> easy, wouldn't that be nsmallest...?

You're reading it right, but it's a bit paradoxical.  Suppose you want
the N extremes from an iterable of length M, and N is much smaller
than M, and you have a min-heap to work with.

If the extremes you want are the N largest, a min-heap of size N (not
M!) suffices to get them.  See the 2.4 code for details, but it's a
simple matter of keeping the N largest seen so far.  A min-heap is
perfect for that, because the smallest of the largest N seen so far is
the smallest element of the current heap, and "the smallest" is what a
min-heap delivers most efficiently.  So the main loop looks like

    sol = result[0]         # sol --> smallest of the nlargest
    for elem in it:
        if elem <= sol:
            continue
        _heapreplace(result, elem)
        sol = result[0]

where "result" contains (just) N elements, and the iterable doesn't
need to be materialized into a giant list first.  Even better, under
many input conditions, "elem <= sol" will usually be true, so the
inner loop is very fast.  Its worst case is when the iterable is in
sorted increasing order (and then "elem <= sol" is never true, and we
have to do _heapreplace each time).

If instead you want the N smallest, a heap of (much larger) size M is
the obvious way to do it, materializing the input iterable into a
giant list (of size M), heapify()'ing that, and then heappop()'ing N
times.  That's less efficient on all counts.

So, in fact, if N is much less than M, heapq.nsmallest() actually uses
a different approach -- one that doesn't use a heap at all!

>> ... what it can.  Neither works as a generator.  There are elegant
>> ways to implement mergesort as a cascade of (O(N)!) recursive
>> generators, but they're horrendously slow in comparison.

> Ouch, I _DID_ know I had to traipse more carefully through 2.4's
> sources... I had missed this ones!!!  Thanks, they're going into the
> appropriate recipes (in the Searching and Sorting chapter -- whose
> introduction you'll no doubt have to vastly rewrite, btw...) ASAP.

I look forward to it <wink>.  Just to be clear, there is no code in
2.4 to do mergesort via recursive generators.  nlargest() and
nsmallest() are there, and they're practical.

> heapq _should_ have reversed= like list.sort, to make a maxheap
> almost as easy as a minheap (and key= too of course), but that's
> another fight^H^H^H^H^H debate...

At that point I'd regret not making these things classes.  They were
meant to be simple.  Maybe fancier stuff can go in the new-in-2.4
collections module (which, again thanks to Raymond, has a new C-coded
deque type with constant-time insert and remove from both ends, and
regardless of access pattern (no "bad cases") -- and, e.g.,
Queue.Queue uses collections.deque as its container type now).

... [timing] ...

> And what about the heapsorts...?  It probably depends on what you
> mean by 'usually', but...:
>
> with s.py being:
> import heapq, random
> 
> def top3_heapq(s):
>    return heapq.nsmallest(3, s)
> 
> def top3_sort(s):
>    return sorted(s)[:3]
> 
> x1 = range(1000)
> random.shuffle(x1)
> 
> on my old-ish iBook with Python 2.4 alpha 3 I see...:
> 
> kallisti:~/cb/cha/sys/cba alex$ python ~/cb/timeit.py -s'import s' 's.top3_heapq(s.x1)'
> 1000 loops, best of 3: 656 usec per loop
> kallisti:~/cb/cha/sys/cba alex$ python ~/cb/timeit.py -s'import s' 's.top3_sort(s.x1)'
> 100 loops, best of 3: 4e+03 usec per loop

The heapq methods are practical (as you just demonstrated).  The
recursive-generator mergesorts really aren't.  Besides just the
overhead of Python, they've got the overhead of creating ~= len(list)
generators, suspending and resuming them incessantly.  By the time the
list is big enough to overcome all those time overheads, chances are
you'll start running out of RAM.



More information about the Python-list mailing list