Parsing apache log files

Josiah Carlson jcarlson at nospam.uci.edu
Fri Feb 20 12:20:00 EST 2004


> thanks, although reading that re makes my brain hurt! :), and I don't
> think it handles the case where the dashes are something else (the dash
> is a place holder for some data that wasn't there on this request,
> bytelength, referrer, something) but I'll look into it, thanks for the
> example. 

It depends on which dash you were talking about.  The dash immediately 
after the response code is the number of bytes sent, and is handled by 
the regular expression.

Unless you use identd checks, the first '-' will always be there, though 
the second '-' is the identity of the client given through http auth, 
which may or may not be important to you.

Modifying the regular expression:
import re
rexp = re.compile('(\d+\.\d+\.\d+\.\d+) (-|\w*) (-|\w*) '
                   '\[([^\[\]:]+):(\d+:\d+:\d+) -(\d\d\d\d\)] '
                   '("[^"]*") (\d+) (-|\d+) ("[^"]*") (".*")\s*\Z')

a = rexp.match(line)
if not a is None:
     a.group(1) #IP address
     a.group(2) #identd response (if any)
     a.group(3) #http auth user
     a.group(4) #day/month/year
     a.group(5) #time of day
     a.group(6) #timezone
     a.group(7) #request
     a.group(8) #code 200 for success, 404 for not found, etc.
     a.group(9) #bytes transferred
     a.group(10) #referrer
     a.group(11) #browser
else:
     #this line did not match.


There you go.
  - Josiah



More information about the Python-list mailing list