space-efficient top-N algorithm

Sun Feb 9 07:32:15 EST 2003

David Garamond:
>I am processing referrer log files with Python. The log files are in 
>ASCII text and line-based, similar to a webserver log but produced by 
>our banner management (web bug, more precisely) software. Each line 
>contains, among other, a URL, a timestamp, and an IP address. One of the 
>desired report is the list of top 50 recurring URLs within a certain 
>period (e.g. monthly or weekly).

I've written a similar algorithm to process the log of a search
engine. It works just fine.

>What I currently do is just read the log files line by line, parse it 
>into its elements, and store the URL string as the key of a dictionary 
>and the number of occurence as the value. In the end I would just get 
>back to the dictionary and print the keys in descending order of the values.

Yep. Mine is a little more complicated, since I also want to
exclude duplicate searches from the same IP number within a
short period of time.

>However, the number of URLs are large and some of the URLs are long 
>(>100 characters). My process grows into more than 100MB in size. I 
>already cut the URLs to a max of 80 characters before entering them into 
>  the dictionary, but it doesn't help much.

You could consider hashing the URL to a digest, using the md5 or
sha module for example. But then you would need to make a second
pass over the log file to translate the top-50 digests to their
URLs.

>Is there a more space-efficient algorithm to produce a list of top N 
>recurring items from a large collection? 

Another option would be to store your table in an external
database. This effectively trades memory for disk space and
runtime performance. Caching the data on frequently occurring
URLs in memory is essential I think.

Perhaps these ideas can be combined:
- store digest->count in a dictionary in memory
- store the digest->URL mapping in an external database

This can be quite efficient when you add one bit to the data in
memory, to indicate that the URL for this digest has already
been written to the database: digest->(count,urlSaved).

In that way, there are no database queries needed to process the
log file. The algorithm only inserts a record in the database
when a new URL is encountered and you only query the database to
construct the report.

-- 
René Pijlman

Wat wil jij leren?  http://www.leren.nl