[Tutor] Logfile Manipulation

ALAN GAULD alan.gauld at btinternet.com
Mon Nov 9 10:10:57 CET 2009



> An apache logfile entry looks like this:
>
>89.151.119.196 - - [04/Nov/2009:04:02:10 +0000] "GET
> /service.php?s=nav&arg[]=&arg[]=home&q=ubercrumb/node%2F20812
> HTTP/1.1" 200 50 "-" "-"
>
>I want to extract 24 hrs of data based timestamps like this:
>
> [04/Nov/2009:04:02:10 +0000]


OK It looks like you could use a regex to extract the first 
thing you find between square brackets. Then convert that to a time.

> I also need to do some filtering (eg I actually don't want anything
> with service.php), 

That's easy enough to detect.

> and I also have to do some substitutions - that's
> trivial other than not knowing the optimum place to do it?  

Assuming they are trivial then...

> I do multiple passes?  Or should I try to do all the work at once,

I'd opt for doing it all in one pass. With such large files you really 
want to minimise the amount of time spent reading the file. 
Plus with such large files you will need/want to process them 
line by line anyway rather than reading the whole thing into memory.

> Also what about reading from compressed files?  
> The data comes in as 6 gzipped logfiles 

Python has a module for that but I've never used it.

BTW A quick google reveals that there are several packages  
for handling Apache log files. That is probably worth investigating 
before you start writing lots of code...

Examples:

Scratchy - The Apache Log Parser and HTML Report Generator for Python
Scratchy is an Apache Web Server log parser and HTML report generator written in Python. Scratchy was created by Phil Schwartz ...
scratchy.sourceforge.net/ - Cached - Similar - 


Loghetti: an apache log file filter in Python - O'Reilly ONLamp Blog
18 Mar 2008 ... Loghetti: an apache log file filter in Python ... This causes loghetti to parsethe query string, and return lines where the query parameter ...
www.oreillynet.com/.../blog/.../loghetti_an_apache_log_file_fi.html -
HTH


Alan G.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20091109/4b9b0f67/attachment-0001.htm>


More information about the Tutor mailing list