[Tutor] log parser speed optimization
Alex
python-tutor-list@tagancha.org
Fri May 16 11:17:02 2003
--17pEHd4RhPHOinZp
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Hello,
I looked at the Tutor archives, but did not find the answer to my
question in the time I had, so I hope I can get some pointers here.
I've been lurking on the list for a looong time, but still can't
remember all the excellent suggestions made by our more experienced
pythonistas :)
I've an itch of putting all the log files from our ftp server that
accumulated over the years into a mysql database, so they could be
queried and patterns could be looked for. I decided to write two scripts
to do that. The first one parses raw log files and writes parsed data
into an output file. I wrote it and it works as needed, but I wonder if
I could speed it up. It takes from 75 to 100 seconds to parse 419951
records of type
'xx.xxx.xxx.xxx - xxxx [01/May/2003:07:08:13 -0500] "GET /dir1/subdir1/subdir2/subdir3/file1" 200 5069672\n'
into something like
'2003-05-01 07:08:13,xxxx,xx.xxx.xxx.xxx,GET,200,/dir1/subdir1/subdir2/subdir3/file1,5069672'
which gives me the speed of 4200 to 5600 records per second. However,
a 100 seconds is still a relatively long time. I attached the script to
this message. If you would be so kind to point me to the bottlenecks in
it I would be very grateful.
Thank you,
Alex.
--17pEHd4RhPHOinZp
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="parseftplog.py"
#!/usr/bin/python
"""This script will parse pure-ftpd logs and write out parsed data for
later analysis.
Date: Fri May 16 10:08:46 EST 2003
Version: 0.05
ChangeLog:
Between ver. 0.04 - 0.05:
Fri May 16 10:08:46 EST 2003
Changed container objects into tuples and dictionaries instead of
lists where I could.
Used 'strip().split()' when parsing raw logfile lines.
Changed date and time parsing algorithm and output to produce the
folowing format (yyyy-mm-dd hh:mm:ss) suitable for MySQL inserts.
Removed an intermediate list. Now it's a 3 step process:
- read a line from input file
- parse the line and datetime separately
- write the line out to the output file
Between ver. 0.03 - 0.04:
Fri May 2 16:35:40 EST 2003
Added parsing of the date/time into separate list items for more
convenience during report generation
Between ver. 0.02 - 0.03:
Fri May 2 11:14:01 EST 2003
Programmed sorting and removal of duplicates. Data is parsed
cleanly and the output is written into a csv file for later
analysis by a report-generating scripts. No bugs are seen.
Between ver. 0.01 - 0.02:
Thr May 1 17:12:33 EST 2003
Corrected parsing scheme. Some '"' were still hanging and data
werent separated cleanly. No sorting and removal of duplication
are in, yet.
Version 0.01:
Wed May 1 10:42:27 EST 2003
Whipped up rough python pseudocode that kind of works.
"""
def usage(prog="purelog"):
print """
prog: Parser of pure-ftpd log files
prog [-h] [-o <output_file>] <filename(s)>
-h print this message
-o <output_file> file to put the resulting parsed logs into
-t run a test with 3 example logs
<filename(s)> name[s] of the log file to parse
"""
def parsedate(datetime):
modict = {'Jan':'01', 'Feb':'02', 'Mar':'03', 'Apr':'04','May':'05','Jun':'06','Jul':'07','Aug':'08','Sep':'09','Oct':'10','Nov':'11','Dec':'12'}
tempdt = ":".join(datetime.split('/')).split(':')
tempdt[1] = modict[tempdt[1]]
tupdt = (tempdt[2],tempdt[1],tempdt[0],tempdt[3],tempdt[4],tempdt[5])
parseddt = '%s-%s-%s %s:%s:%s' % tupdt
return parseddt
def parseline(logline):
inlist = logline.strip().split()
datetime = parsedate(inlist.pop(3)[1:])
user = inlist.pop(2)
inlist.pop(1)
inlist.pop(1)
ip = inlist.pop(0)
action = inlist.pop(0)[1:]
fsize = inlist.pop(-1)
result = inlist.pop(-1)
fname = " ".join(inlist)[:-1]
return "%s,%s,%s,%s,%s,%s,%s" % (datetime, user, ip, action, result, fname, fsize)
def testrun():
line1 = 'xx.xxx.xxx.xxx - xxxx [01/May/2003:07:08:13 -0500] "GET /dir1/subdir1/subdir2/subdir3/file1" 200 5069672\n'
line2 = 'xxx.xxx.xx.xx - xxxx [12/Sep/2002:12:42:17 -0600] "GET /dir2/subdir1/subdir2/Libraries/10 PCRs for probe generation to get rid of second set of majors.tif" 200 84066\n'
line3 = 'xxx.xxx.xx.xx - xxxxxx [29/Jul/2002:16:12:37 -0600] "PUT /dir3/subdir1/sub dir 2/file1" 200 158300\n'
pseudofile = [line1,line2,line3]
for pseudoline in pseudofile:
outline = parseline(pseudoline) + '\n'
print outline
if __name__=='__main__':
import sys, getopt
from time import time
starttime = time()
o, a = getopt.getopt(sys.argv[1:], 'hto:')
opts = {}
for k,v in o:
opts[k] = v
if opts.has_key('-h'):
usage(); sys.exit()
if opts.has_key('-t'):
testrun(); sys.exit('Test run completed successfully!')
if opts.has_key('-o'):
outfilename = opts['-o']
outmode = 'w'
outfile = file(outfilename, outmode)
else:
print 'Output file name is "pureftpd-parsed.log"'
outfile = file('pureftpd-parsed.log','w')
if len(a) < 1:
usage(); sys.exit('log file[s] name/pattern missing')
logfiles = a
for i in logfiles:
print 'Log file to process:', i
if i.split('.')[-1] == 'gz':
import gzip
logfile = gzip.open(i,'rt')
else:
logfile = file(i,'rt')
while 1:
line = logfile.readline()
if not line:
break
else:
outline = parseline(line) + '\n'
outfile.write(outline)
outfile.close()
endtime = time()
runtime = endtime - starttime
print 'Your run proceeded for %.2f seconds' % (runtime)
sys.exit('All raw log files have been parsed. Goodbye!')
--17pEHd4RhPHOinZp--