Help with script with performance problems

Dennis Roberts googlegroups at spacerodent.org
Sun Nov 23 02:35:56 EST 2003


I have a script to parse a dns querylog and generate some statistics. 
For a 750MB file a perl script using the same methods (splits) can
parse the file in 3 minutes.  My python script takes 25 minutes.  It
is enough of a difference that unless I can figure out what I did
wrong or a better way of doing it I might not be able to use python
(since most of what I do is parsing various logs).  The main reason to
try python is I had to look at some early scripts I wrote in perl and
had no idea what the hell I was thinking or what the script even did! 
After some googling and reading Eric Raymonds essay on python I jumped
in:)  Here is my script.  I am looking for constructive comments -
please don't bash my newbie code.

#!/usr/bin/python -u

import string
import sys

clients = {}
queries = {}
count = 0

print "Each dot is 100000 lines..."

f = sys.stdin

while 1:

    line = f.readline()

    if count % 100000 == 0:
        sys.stdout.write(".")

    if line:
        splitline = string.split(line)

        try:
            (month, day, time, stype, source, qtype, query, ctype,
record) = splitline
        except:
            print "problem spliting line", count
            print line
            break

        try:
            words = string.split(source,'#')
            source = words[0]
        except:
            print "problem splitting source", count
            print line
            break

        if clients.has_key(source):
            clients[source] = clients[source] + 1
        else:
            clients[source] = 1

        if queries.has_key(query):
            queries[query] = queries[query] + 1
        else:
            queries[query] = 1

 else:
        print
        break

    count = count + 1

f.close()

print count, "lines processed"

for numclient, count in clients.items():
    if count > 100000:
        print "%s,%s" % (numclient, count)

for numquery, count in queries.items():
    if count > 100000:
        print "%s,%s" % (numquery, count)




More information about the Python-list mailing list