[Tutor] Simple Stats on Apache Logs

Thu Feb 11 14:21:52 CET 2010

On Thu, Feb 11, 2010 at 4:56 AM, Lao Mao <laomao1975 at googlemail.com> wrote:
> Hi,
>
> I have 3 servers which generate about 2G of webserver logfiles in a day.
> These are available on my machine over NFS.
>
> I would like to draw up some stats which shows, for a given keyword, how
> many times it appears in the logs, per hour, over the previous week.
>
> So the behavior might be:
>
> $ ./webstats --keyword downloader
>
> Which would read from the logs (which it has access to) and produce
> something like:
>
> Monday:
> 0000: 12
> 0100: 17
>
> etc
>
> I'm not sure how best to get started.  My initial idea would be to filter
> the logs first, pulling out the lines with matching keywords, then check the
> timestamp - maybe incrementing a dictionary if the logfile was within a
> certain time?

I would use itertools.groupby() to group lines by hour, then look for
the keywords and increment a count. The technique of stacking
generators as a processing pipeline might be useful. See David
Beazley's "Generator Tricks for System Programmers"
http://www.dabeaz.com/generators-uk/index.html

Loghetti might also be useful as a starting point or code reference:
http://code.google.com/p/loghetti/

Kent