Help with script with performance problems

Sun Nov 23 16:04:36 EST 2003

Dennis Roberts wrote:

> I have a script to parse a dns querylog and generate some statistics.
> For a 750MB file a perl script using the same methods (splits) can
> parse the file in 3 minutes.  My python script takes 25 minutes.  It
> is enough of a difference that unless I can figure out what I did
> wrong or a better way of doing it I might not be able to use python
> (since most of what I do is parsing various logs).  The main reason to
> try python is I had to look at some early scripts I wrote in perl and
> had no idea what the hell I was thinking or what the script even did!
> After some googling and reading Eric Raymonds essay on python I jumped
> in:)  Here is my script.  I am looking for constructive comments -
> please don't bash my newbie code.

Below is my version of your script. It tries to use more idiomatic Python
and is about 20%t faster on some bogus data - but nowhere near to close the
performance gap you claim to the perl script.
However, it took 143 seconds to process 10**7 lines generated by

<makesample.py>
import itertools, sys
sample = "%dmonth day time stype source%d#sowhat qtype %dquery ctype record"
thousand = itertools.cycle(range(1000))
hundred = itertools.cycle(range(100))

out = file(sys.argv[1], "w")
try:
    try:
        count = int(sys.argv[2])
    except IndexError:
        count = 10**7
    for i in range(count):
        print >> out, sample % (i, thousand.next(), hundred.next())
finally:
    out.close()
</makesample.py>

with Python 2.3.2 on my 2.6GHz P4. Would that mean Perl would do it in 17
seconds? Anyway, the performance problem would rather be your computer :-),
Python should be fast enough for the purpose.

Peter

<parselog.py>
#!/usr/bin/python -u
#Warning, not seriously tested
import sys

#import time
#starttime = time.time()

clients = {}
queries = {}
lineNo = -1

threshold = 100
pointmod = 100000

f = file(sys.argv[1])
try:
    print "Each dot is %d lines..." % pointmod
    for lineNo, line in enumerate(f):
        if lineNo % pointmod == 0:
            sys.stdout.write(".")

        try:
            month, day, timestr, stype, source, qtype, query, ctype, record
= line.split()
        except ValueError:
            raise Exception("problem splitting line %d\n%s" % (lineNo,
line))

        source = source.split('#', 1)[0]

        clients[source] = clients.get(source, 0) + 1
        queries[query] = queries.get(query, 0) + 1
finally:
    f.close()

print
print lineNo+1, "lines processed"

for numclient, count in clients.iteritems():
    if count > threshold:
        print "%s,%s" % (numclient, count)

for numquery, count in queries.iteritems():
    if count > threshold:
        print "%s,%s" % (numquery, count)

#print "time:", time.time() - starttime
</parselog.py>