space-efficient top-N algorithm

Sun Feb 9 13:06:17 EST 2003

David Garamond <lists at zara.6.isreserved.com> wrote:
> I am processing referrer log files with Python. The log files are in
> ASCII text and line-based, similar to a webserver log but produced by
> our banner management (web bug, more precisely) software. Each line
> contains, among other, a URL, a timestamp, and an IP address. One of
> the desired report is the list of top 50 recurring URLs within a
> certain period (e.g. monthly or weekly).
> 
> What I currently do is just read the log files line by line, parse it
> into its elements, and store the URL string as the key of a dictionary
> and the number of occurence as the value. In the end I would just get
> back to the dictionary and print the keys in descending order of the
> values.
> 
> However, the number of URLs are large and some of the URLs are long
> (>100 characters). My process grows into more than 100MB in size. I
> already cut the URLs to a max of 80 characters before entering them
> into the dictionary, but it doesn't help much.
> 
> Is there a more space-efficient algorithm to produce a list of top N
> recurring items from a large collection? I've thought about sorting
> the large log files externally, but reluctant to do this because of
> the large size of the files and the need to frequently produce reports
> while new logs arrive.

What's wrong with something like

    awk '/Jan 1, 2003/,/Feb 1, 2003/ {print $1}' log \
	| sort | uniq -c | sort | head -50

where I am assuming that URL is the first field in your log and URL
doesn't contain any spaces.

-- 
William Park, Open Geometry Consulting, <opengeometry at yahoo.ca>
Linux solution for data management and processing.