space-efficient top-N algorithm

Sun Feb 9 17:24:48 EST 2003

David Garamond wrote:
> I am processing referrer log files with Python. 
<<SNIP>>
> Is there a more space-efficient algorithm to produce a list of top N 
> recurring items from a large collection? I've thought about sorting the 
> large log files externally, but reluctant to do this because of the 
> large size of the files and the need to frequently produce reports while 
> new logs arrive.
> 

I don't know what you are running on, but have you considered buying more memory?
I DO mean to be helpful. Memory IS cheap especially for PCs and also for workstations if 
you don't buy it from the WS manufacturer.

Failing that, If you want further processing on just the top N items, and are using Unix, 
you could use the pipeline of sorts and uniq commands posted before to find the top N, 
(fast - space efficient C coded utilities), then process the log again ignoring all but 
the top N records using Python (for flexibility).

Paddy.