[Tutor] Simple Stats on Apache Logs

Thu Feb 11 12:18:03 CET 2010

On Thu, 11 Feb 2010 09:56:51 +0000
Lao Mao <laomao1975 at googlemail.com> wrote:

> Hi,
> 
> I have 3 servers which generate about 2G of webserver logfiles in a day.
> These are available on my machine over NFS.
> 
> I would like to draw up some stats which shows, for a given keyword, how
> many times it appears in the logs, per hour, over the previous week.
> 
> So the behavior might be:
> 
> $ ./webstats --keyword downloader
> 
> Which would read from the logs (which it has access to) and produce
> something like:
> 
> Monday:
> 0000: 12
> 0100: 17
> 
> etc
> 
> I'm not sure how best to get started.  My initial idea would be to filter
> the logs first, pulling out the lines with matching keywords, then check the
> timestamp - maybe incrementing a dictionary if the logfile was within a
> certain time?
> 
> I'm not looking for people to write it for me, but I'd appreciate some
> guidance as the the approach and algorithm.  Also what the simplest
> presentation model would be.  Or even if it would make sense to stick it in
> a database!  I'll post back my progress.

As your logfile in rather big, I would iterate line per line using "for line in file". Check each line to determine whether (1) it has changed hour (2) it holds the given keyword. For the presentation, it depends on your expectation! Also, are keywords constants (= predefined at coding time)? You may think at a python (nested) dict:

keywordStats = {}
...
keywordStats[one_keyword] = {
   Monday: [12, 17, ...] ,
   ...
}

(hours in 24 format (actually 0..23) are implicit keys of the lists)
This lets open the opportunity to read the info back into the machine...

Denis
________________________________

la vita e estrany

http://spir.wikidot.com/