space-efficient top-N algorithm

Sun Feb 9 21:13:40 EST 2003

On Sunday 09 February 2003 12:35, Alex Martelli wrote:
> Christos TZOTZIOY Georgiou wrote:
> > On Sun, 09 Feb 2003 18:56:06, rumours say that David Garamond
> >
> > <lists at zara.6.isreserved.com> might have written:
> >>Is there a more space-efficient algorithm to produce a list of top N
> >>recurring items from a large collection? I've thought about sorting the
> >>large log files externally, but reluctant to do this because of the
> >>large size of the files and the need to frequently produce reports while
> >>new logs arrive.
> >
> > A solution using heapq (for ordering while reading) + sets (for
> > uniqueness) might be appropriate.  You just have to del heap[N:] once in
> > a while to avoid excess memory consumption.
>
> Either I misread David's problem, or you did.  What I'm reading
> is that David wants to COUNT how many times each item occurs (and
> keep the N items that occur most often).  How would "ordering while
> reading" etc help him?

I thought about a similar solution - using a heap to help with the process - 
but ended up deleting it from my own reply. Part of the problem is that the 
heap will end up being as big as the original structure, which means that no 
space would be saved.

It *is* possible, however, to come up with a clever implementation that 
*statistically* selects the top N URLs, using the two heaps as auxiliar 
structures. 

a) use two heaps, one for 'short term memory' and the other one for 'long term 
memory'. 

b) read the URLs into the 'short term' memory heap, incrementing the URL 
counter, in such a way that the heap will be ordered with the most frequent 
element at its top. There is a limited number of entries in the short term 
heap, and less frequent entries are discarded when the heap fills out.

c) everytime you increment the URL counter for a given record, the long term 
heap must be updated accordingly. 

- if the URL is present in the long term heap, update the value (this is not 
needed if the object references are shared); 
- if the URL is not present in the long term heap, check if its counter value 
is greater than the less frequent element in the long term heap. If it is, 
then include the URL in the long term heap (it may be needed to discard the 
less frequent element from the long term heap to limit its growth).

d) whenever you encounter an URL that is not present in the 'short term' heap, 
check if it was present in the long term heap before. If so, retrieve it from 
the long term heap and keep a copy in the short term heap.

The data structures used must be accessible in two ways: as a heap, keeping 
the data ordered; and as a dict, to allow retrieval by URL.

This algorithm is not deterministic; it cannot guarantee that you get the 'top 
N' elements. It depends on the size of the heaps, and also on the 
distribution of data. But it may be worth trying in some situations, 
specially if a deterministic approach turn out to be not viable.

Carlos Ribeiro
cribeiro at mail.inet.com.br