space-efficient top-N algorithm

Mon Feb 10 10:20:01 EST 2003

> David Garamond <lists at zara.6.isreserved.com> wrote in message
news:<mailman.1044791875.2092.python-list at python.org>...
> > I am processing referrer log files with Python. The log files are
in
> > ASCII text and line-based, similar to a webserver log but produced
by
> > our banner management (web bug, more precisely) software. Each
line
> > contains, among other, a URL, a timestamp, and an IP address. One
of the
> > desired report is the list of top 50 recurring URLs within a
certain
> > period (e.g. monthly or weekly).

> > What I currently do is just read the log files line by line, parse
it
> > into its elements, and store the URL string as the key of a
dictionary
> > and the number of occurence as the value. In the end I would just
get
> > back to the dictionary and print the keys in descending order of
the values.

The distribution of counts should tend to be fairly consistent from
period to period.  A few urls with high counts, more with medium
counts, and lots with few counts, down to one.  With a bit of
experience, you should have a pretty good idea of how many counts are
required to be in top 50.  So I would consider scanning dict with
iteritems to extract candidates, aiming for, say, the top 100.  IE,
keep items with count above threshhold.  If get too few or start to
get too many, restart with adjusted threshhold. Then sort that reduced
list.

Terry J. Reedy