Keeping track of the N largest values

Peter Otten __peter__ at web.de
Sun Dec 26 05:36:40 EST 2010


Roy Smith wrote:

> In article <Xns9E59A44D7CC49duncanbooth at 127.0.0.1>,
>  Duncan Booth <duncan.booth at invalid.invalid> wrote:
> 
>> Roy Smith <roy at panix.com> wrote:
>> 
>> > 
>> > I'm processing a stream of N numbers and want to keep track of the K
>> > largest.  There's too many numbers in the stream (i.e. N is too large)
>> > to keep in memory at once.  K is small (100 would be typical).
>> > ...
>> > Is there a better way to do this, either from a theoretical running
>> > time point of view, or just a nicer way to code this in Python?
>> 
>> How about:
>> 
>>   from heapq import nlargest
>>   top = nlargest(K, input())
>> 
>> It uses a heap so avoids completely resorting each time.
> 
> Hmmm, that looks like it would solve the problem as stated, but there's
> another twist.  In addition to finding the K largest values, I *also*
> need to do some other processing on all the values (which I didn't show
> in the original example, incorrectly thinking it not germane to my
> question).  The problem with nlargest() is that it doesn't give me a
> hook to do that.

If Paul's solution doesn't suffice -- the heapq module has the building 
blocks for a custom solution, e. g.:

import heapq
from functools import partial

class NLargest(object):
    def __init__(self, n):
        self.n = n
        self.heap = []
    def tally(self, item):
        heap = self.heap
        if len(heap) >= self.n:
            heapq.heappushpop(heap, item)
            self.tally = partial(heapq.heappushpop, heap)
        else:
            heapq.heappush(heap, item)
    def __str__(self):
        return str(sorted(self.heap, reverse=True))

if __name__ == "__main__":
    import random
    items = range(100)
    random.shuffle(items)
    accu = NLargest(10)
    for item in items:
        accu.tally(item)
        print item, accu

 
> PS: I'm assuming heapq.nlargest(n, iterable) operates with memory
> proportional to n, and not to the iterator length.  That's the only
> reasonable conclusion, but the docs don't actually come out and say it.

I would hope so.



More information about the Python-list mailing list