An IMHO interesting optimization question...

Wed Apr 24 19:15:14 EDT 2002

Hello,

I have an optimiziation question. Given the following code
which generates a word length statistic up to words of the length
<longWord>.

=-------------------------------------------------=
#!/usr/bin/python

def count( word ):
   "Increments the wordcounter"
   length = len( word )
   if length > longWord:
       length = longWord
   stats[length] += 1

def printout():
    "Formatted output of dictionary"
    for i in xrange( 1, longWord ):
        print "Length %d = #%d" % ( i, stats[i] )
    print "Length >%d = #%d" % ( longWord, stats[longWord] )

# import libraries and check for help request

import sys, re
if len( sys.argv ) != 1:
    print """
wordStatistic.py 1.0
--------------------
Reads from stdin and prints a word statistic.
Usage: cat <file> | wordStatistic.py
"""
    sys.exit( 0 )

longWord = 15
defWord = "[A-Za-z]+"

stats = {}
wordExp = re.compile( defWord )

# initialize dictionary, so we don't have to check for key-existance later
for i in xrange( 1, longWord+1 ):
    stats[i] = 0

# main loop. iterates over standard input as long as there's
# something coming in...
while 1:
    line = sys.stdin.readline()
    if line:
        for word in wordExp.findall( line ):
            count( word )
    else:
        break
=-------------------------------------------------=

For a given document, this program takes 20 seconds on my machine.
I now try to optimize it by eliminating the while loop. The 2nd version looks like:

=-------------------------------------------------=
# main loop. iterates over standard input as long as there's
# something coming in...
lines = sys.stdin.readlines()
for line in lines:
    for word in wordExp.findall( line ):
        count( word )
=-------------------------------------------------=

This takes 14 seconds on my machine which seems to be quite an achievement.

Now on to the next. I try to eliminate the for loop by substituting with a map statement,
which is supposed to be faster (because it's looping in C). The 3rd version looks like:

=-------------------------------------------------=
# main loop. iterates over standard input as long as there's
# something coming in...
lines = sys.stdin.readlines()
for line in lines:
    map( count, wordExp.findall( line ) )
=-------------------------------------------------=

Now this takes 17 seconds on my machine. Wow? Why that? Why does it take longer?

On to the final version, which transforms the outer loop into a map statement:

=-------------------------------------------------=
# main loop. iterates over standard input as long as there's
# something coming in...
map( lambda line: map( count, wordExp.findall( line ) ), sys.stdin.readlines() )
=-------------------------------------------------=

This is even worse - it takes 18 seconds on my computer. What am I missing? And do you
know a version which would be even faster than the 2nd version?

Yours,

:M:
--
|----------------------------------------------------------------------------|
| Dipl.-Inf. Michael 'Mickey' Lauer    mickey at tm.informatik.uni-frankfurt.de |
|   Raum 10b - ++49 69 798 28358        Fachbereich Informatik und Biologie  |
|----------------------------------------------------------------------------|