Parsing apache log files

Jim Richardson warlock at eskimo.com
Fri Feb 20 00:33:44 EST 2004


I am pulling apart some big apache logs (800-1000MB) for some analysis,
and stuffing it into a MySQL database. Most of it goes ok, despite my
meager coding abilities. But every so often I run across "borken" bits
of data, like user agent strings that include "'/\ and such, although
they are escaped by apache in writing the log, they break up my somewhat
clunky splits. 

A typical (good) line, looks like this


111.111.111.11 - - [16/Feb/2004:04:09:49 -0800] "GET /ads/redirectads/336x280redirect.htm HTTP/1.1" 304 - "http://www.foobarp.org/theme_detail.php?type=vs&cat=0&mid=27512" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

which I can split fine, by spliting on the " first, then splitting each
bit up on the appropriate thing. mostly spaces. But occasionaly I get
something like


11.111.11.111 - - [16/Feb/2004:10:35:12 -0800] "GET /ads/redirectads/468x60redirect.htm HTTP/1.1" 200 541 "http://11.11.111.11/adframe.php?n=ad1f311a&what=zone:56" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1) Opera 7.20  [ru\"]" 

note the [ru\" at the end. 


I am looking for a way to strip out the IP, day, time, requested url,
referrer, bytes, status, and user agent, and what I have, though a bit
crufty, works 99.99% of the time, but then something like this shows up. 

I have a couple of approaches. Reject the bad entries, save them to a
file, then manually enter them, problem is, with 10 million entries, and
about 1 in 1000 being bad... 


Although as I write this, I think maybe I can use the \ to warn me, and
behave accordingly? hm, I'll have to try that. 

In the meantime, is there some obvious method, or module that I have
missed ? 
 

-- 
Jim Richardson     http://www.eskimo.com/~warlock
 Windows is the answer, but only if the question was
 'what is the intellectual equivalent of being a galley slave?' 
	--Larry Smith, in comp.os.linux.misc



More information about the Python-list mailing list