An IMHO interesting optimization question...
Fernando Pérez
fperez528 at yahoo.com
Wed Apr 24 21:04:40 EDT 2002
Michael 'Mickey' Lauer wrote:
> Hello,
>
> I have an optimiziation question. Given the following code
> which generates a word length statistic up to words of the length
> <longWord>.
>
[snip rest]
The simple rewrite below gives the following results:
[MySQL-3.23.47]> time ~/test/wordstats_ori.py < manual.texi
5.000u 0.010s 0:05.02 99.8% 0+0k 0+0io 484pf+0w
[MySQL-3.23.47]> time ~/test/wordstats_ori.py < manual.texi
4.960u 0.030s 0:05.01 99.6% 0+0k 0+0io 484pf+0w
[MySQL-3.23.47]> time ~/test/wordstats.py < manual.texi
2.420u 0.260s 0:02.70 99.2% 0+0k 0+0io 484pf+0w
[MySQL-3.23.47]> time ~/test/wordstats.py < manual.texi
2.560u 0.130s 0:02.70 99.6% 0+0k 0+0io 484pf+0w
This is running on a 2.3Mb file (the MySQL manual in texinfo format).
So these simple changes (using an array of integers, inlining count and
forgoing the many unnecessary loops) give a factor of 2 improvement. Plus the
code is more readable IMHO.
A simple check shows that the expensive part is the regexp stuff. So you could
optimize the counting trivially in C using weave, but it wouldn't buy you
much: about 80% of the execution time is spent in the regexp findall.
So it looks like if you need even faster execution, you'll have to go beyond
python, but writing regexp code by hand is _exactly_ the kind of thing for
which I'll kick and scream to avoid C. Just have a cup of coffee while it
finishes :)
Cheers,
f.
ps. Don't do 'cat file | script', that's what 'script < file' is for ;)
#------------------------------------------------------------------
#!/usr/bin/env python
def printout():
"Formatted output of dictionary"
for i in range( 1, longWord ):
print "Length %d = #%d" % ( i, stats[i] )
print "Length >%d = #%d" % ( longWord, stats[longWord] )
# import libraries and check for help request
import sys, re
if len( sys.argv ) != 1:
print """
wordStatistic.py 1.0
--------------------
Reads from stdin and prints a word statistic.
Usage: wordStatistic.py < file
"""
sys.exit( 0 )
defWord = "[A-Za-z]+"
wordExp = re.compile( defWord )
# initialize array, so we don't have to check for key-existance later
longWord = 15
stats = [0]*(longWord+1)
# main loop. iterates over standard input as long as there's
# something coming in...
for word in wordExp.findall( sys.stdin.read() ):
try:
stats[len(word)] += 1
except IndexError:
stats[-1] += 1
#printout()
#------------------------------------------------------------------
More information about the Python-list
mailing list