An IMHO interesting optimization question...
Michael 'Mickey' Lauer
mickey at tm.informatik.uni-frankfurt.de
Wed Apr 24 19:15:14 EDT 2002
Hello,
I have an optimiziation question. Given the following code
which generates a word length statistic up to words of the length
<longWord>.
=-------------------------------------------------=
#!/usr/bin/python
def count( word ):
"Increments the wordcounter"
length = len( word )
if length > longWord:
length = longWord
stats[length] += 1
def printout():
"Formatted output of dictionary"
for i in xrange( 1, longWord ):
print "Length %d = #%d" % ( i, stats[i] )
print "Length >%d = #%d" % ( longWord, stats[longWord] )
# import libraries and check for help request
import sys, re
if len( sys.argv ) != 1:
print """
wordStatistic.py 1.0
--------------------
Reads from stdin and prints a word statistic.
Usage: cat <file> | wordStatistic.py
"""
sys.exit( 0 )
longWord = 15
defWord = "[A-Za-z]+"
stats = {}
wordExp = re.compile( defWord )
# initialize dictionary, so we don't have to check for key-existance later
for i in xrange( 1, longWord+1 ):
stats[i] = 0
# main loop. iterates over standard input as long as there's
# something coming in...
while 1:
line = sys.stdin.readline()
if line:
for word in wordExp.findall( line ):
count( word )
else:
break
=-------------------------------------------------=
For a given document, this program takes 20 seconds on my machine.
I now try to optimize it by eliminating the while loop. The 2nd version looks like:
=-------------------------------------------------=
# main loop. iterates over standard input as long as there's
# something coming in...
lines = sys.stdin.readlines()
for line in lines:
for word in wordExp.findall( line ):
count( word )
=-------------------------------------------------=
This takes 14 seconds on my machine which seems to be quite an achievement.
Now on to the next. I try to eliminate the for loop by substituting with a map statement,
which is supposed to be faster (because it's looping in C). The 3rd version looks like:
=-------------------------------------------------=
# main loop. iterates over standard input as long as there's
# something coming in...
lines = sys.stdin.readlines()
for line in lines:
map( count, wordExp.findall( line ) )
=-------------------------------------------------=
Now this takes 17 seconds on my machine. Wow? Why that? Why does it take longer?
On to the final version, which transforms the outer loop into a map statement:
=-------------------------------------------------=
# main loop. iterates over standard input as long as there's
# something coming in...
map( lambda line: map( count, wordExp.findall( line ) ), sys.stdin.readlines() )
=-------------------------------------------------=
This is even worse - it takes 18 seconds on my computer. What am I missing? And do you
know a version which would be even faster than the 2nd version?
Yours,
:M:
--
|----------------------------------------------------------------------------|
| Dipl.-Inf. Michael 'Mickey' Lauer mickey at tm.informatik.uni-frankfurt.de |
| Raum 10b - ++49 69 798 28358 Fachbereich Informatik und Biologie |
|----------------------------------------------------------------------------|
More information about the Python-list
mailing list