[Tutor] log parser speed optimization

Fri May 16 11:17:02 2003

--17pEHd4RhPHOinZp
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

Hello,

I looked at the Tutor archives, but did not find the answer to my
question in the time I had, so I hope I can get some pointers here.
I've been lurking on the list for a looong time, but still can't
remember all the excellent suggestions made by our more experienced
pythonistas :)

I've an itch of putting all the log files from our ftp server that
accumulated over the years into a mysql database, so they could be
queried and patterns could be looked for. I decided to write two scripts
to do that. The first one parses raw log files and writes parsed data
into an output file. I wrote it and it works as needed, but I wonder if
I could speed it up. It takes from 75 to 100 seconds to parse 419951
records of type
'xx.xxx.xxx.xxx - xxxx [01/May/2003:07:08:13 -0500] "GET /dir1/subdir1/subdir2/subdir3/file1" 200 5069672\n'
into something like
'2003-05-01 07:08:13,xxxx,xx.xxx.xxx.xxx,GET,200,/dir1/subdir1/subdir2/subdir3/file1,5069672'

which gives me the speed of 4200 to 5600 records per second. However,
a 100 seconds is still a relatively long time. I attached the script to
this message. If you would be so kind to point me to the bottlenecks in
it I would be very grateful.

Thank you,

Alex.

--17pEHd4RhPHOinZp
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="parseftplog.py"

#!/usr/bin/python

"""This script will parse pure-ftpd logs and write out parsed data for
later analysis.
Date:   Fri May 16 10:08:46 EST 2003
Version: 0.05
ChangeLog:
    Between ver. 0.04 - 0.05:
        Fri May 16 10:08:46 EST 2003
        Changed container objects into tuples and dictionaries instead of
        lists where I could.
        Used 'strip().split()' when parsing raw logfile lines.
        Changed date and time parsing algorithm and output to produce the
        folowing format (yyyy-mm-dd hh:mm:ss) suitable for MySQL inserts.
        Removed an intermediate list. Now it's a 3 step process:
            - read a line from input file
            - parse the line and datetime separately
            - write the line out to the output file
        Between ver. 0.03 - 0.04:
        Fri May  2 16:35:40 EST 2003
        Added parsing of the date/time into separate list items for more
        convenience during report generation
    Between ver. 0.02 - 0.03:
        Fri May  2 11:14:01 EST 2003
        Programmed sorting and removal of duplicates. Data is parsed
        cleanly and the output is written into a csv file for later
        analysis by a report-generating scripts. No bugs are seen.
    Between ver. 0.01 - 0.02:
        Thr May  1 17:12:33 EST 2003
        Corrected parsing scheme. Some '"' were still hanging and data
        werent separated cleanly. No sorting and removal of duplication
        are in, yet.
    Version 0.01:
        Wed May  1 10:42:27 EST 2003
        Whipped up rough python pseudocode that kind of works.
"""

def usage(prog="purelog"):
    print """
    prog: Parser of pure-ftpd log files

    prog [-h] [-o <output_file>] <filename(s)>

    -h                  print this message
    -o <output_file>    file to put the resulting parsed logs into
    -t                  run a test with 3 example logs
    <filename(s)>       name[s] of the log file to parse

    """
def parsedate(datetime):
    modict = {'Jan':'01', 'Feb':'02', 'Mar':'03', 'Apr':'04','May':'05','Jun':'06','Jul':'07','Aug':'08','Sep':'09','Oct':'10','Nov':'11','Dec':'12'}
    tempdt = ":".join(datetime.split('/')).split(':')
    tempdt[1] = modict[tempdt[1]]
    tupdt = (tempdt[2],tempdt[1],tempdt[0],tempdt[3],tempdt[4],tempdt[5])
    parseddt = '%s-%s-%s %s:%s:%s' % tupdt
    return parseddt

def parseline(logline):
    inlist = logline.strip().split()
    datetime = parsedate(inlist.pop(3)[1:])
    user = inlist.pop(2)
    inlist.pop(1)
    inlist.pop(1)
    ip = inlist.pop(0)
    action = inlist.pop(0)[1:]
    fsize = inlist.pop(-1)
    result = inlist.pop(-1)
    fname = " ".join(inlist)[:-1]
    return "%s,%s,%s,%s,%s,%s,%s" % (datetime, user, ip, action, result, fname, fsize)

def testrun():
    line1 = 'xx.xxx.xxx.xxx - xxxx [01/May/2003:07:08:13 -0500] "GET /dir1/subdir1/subdir2/subdir3/file1" 200 5069672\n'
    line2 = 'xxx.xxx.xx.xx - xxxx [12/Sep/2002:12:42:17 -0600] "GET /dir2/subdir1/subdir2/Libraries/10 PCRs for probe generation to get rid of second set of majors.tif" 200 84066\n'
    line3 = 'xxx.xxx.xx.xx - xxxxxx [29/Jul/2002:16:12:37 -0600] "PUT /dir3/subdir1/sub dir 2/file1" 200 158300\n'
    pseudofile = [line1,line2,line3]
    for pseudoline in pseudofile:
        outline = parseline(pseudoline) + '\n'
        print outline

if __name__=='__main__':
    import sys, getopt
    from time import time
    starttime = time()
    o, a = getopt.getopt(sys.argv[1:], 'hto:')
    opts = {}
    for k,v in o:
        opts[k] = v
    if opts.has_key('-h'):
        usage(); sys.exit()
    if opts.has_key('-t'):
        testrun(); sys.exit('Test run completed successfully!')
    if opts.has_key('-o'):
        outfilename = opts['-o']
        outmode = 'w'
        outfile = file(outfilename, outmode)
    else:
        print 'Output file name is "pureftpd-parsed.log"'
        outfile = file('pureftpd-parsed.log','w')
    if len(a) < 1:
        usage(); sys.exit('log file[s] name/pattern missing')
    logfiles = a
    for i in logfiles:
        print 'Log file to process:', i
        if i.split('.')[-1] == 'gz':
            import gzip
            logfile = gzip.open(i,'rt')
        else:
            logfile = file(i,'rt')
        while 1:
            line = logfile.readline()
            if not line:
                break
            else:
                outline = parseline(line) + '\n'
                outfile.write(outline)
    outfile.close()
    endtime = time()
    runtime = endtime - starttime
    print 'Your run proceeded for %.2f seconds' % (runtime)
    sys.exit('All raw log files have been parsed. Goodbye!')

--17pEHd4RhPHOinZp--