space-efficient top-N algorithm
Alex Martelli
aleax at aleax.it
Sun Feb 9 09:35:31 EST 2003
Christos TZOTZIOY Georgiou wrote:
> On Sun, 09 Feb 2003 18:56:06, rumours say that David Garamond
> <lists at zara.6.isreserved.com> might have written:
>
>>Is there a more space-efficient algorithm to produce a list of top N
>>recurring items from a large collection? I've thought about sorting the
>>large log files externally, but reluctant to do this because of the
>>large size of the files and the need to frequently produce reports while
>>new logs arrive.
>
> A solution using heapq (for ordering while reading) + sets (for
> uniqueness) might be appropriate. You just have to del heap[N:] once in
> a while to avoid excess memory consumption.
Either I misread David's problem, or you did. What I'm reading
is that David wants to COUNT how many times each item occurs (and
keep the N items that occur most often). How would "ordering while
reading" etc help him?
Alex
More information about the Python-list
mailing list