space-efficient top-N algorithm

Alex Martelli aleax at aleax.it
Sun Feb 9 09:35:31 EST 2003


Christos TZOTZIOY Georgiou wrote:

> On Sun, 09 Feb 2003 18:56:06, rumours say that David Garamond
> <lists at zara.6.isreserved.com> might have written:
> 
>>Is there a more space-efficient algorithm to produce a list of top N
>>recurring items from a large collection? I've thought about sorting the
>>large log files externally, but reluctant to do this because of the
>>large size of the files and the need to frequently produce reports while
>>new logs arrive.
> 
> A solution using heapq (for ordering while reading) + sets (for
> uniqueness) might be appropriate.  You just have to del heap[N:] once in
> a while to avoid excess memory consumption.

Either I misread David's problem, or you did.  What I'm reading
is that David wants to COUNT how many times each item occurs (and
keep the N items that occur most often).  How would "ordering while
reading" etc help him?


Alex





More information about the Python-list mailing list