An IMHO interesting optimization question...

Wed Apr 24 21:04:40 EDT 2002

Michael 'Mickey' Lauer wrote:

> Hello,
> 
> I have an optimiziation question. Given the following code
> which generates a word length statistic up to words of the length
> <longWord>.
> 

[snip rest]

The simple rewrite below gives the following results:

[MySQL-3.23.47]> time ~/test/wordstats_ori.py < manual.texi
5.000u 0.010s 0:05.02 99.8%     0+0k 0+0io 484pf+0w
[MySQL-3.23.47]> time ~/test/wordstats_ori.py < manual.texi
4.960u 0.030s 0:05.01 99.6%     0+0k 0+0io 484pf+0w
[MySQL-3.23.47]> time ~/test/wordstats.py < manual.texi
2.420u 0.260s 0:02.70 99.2%     0+0k 0+0io 484pf+0w
[MySQL-3.23.47]> time ~/test/wordstats.py < manual.texi
2.560u 0.130s 0:02.70 99.6%     0+0k 0+0io 484pf+0w

This is running on a 2.3Mb file (the MySQL manual in texinfo format).

So these simple changes (using an array of integers, inlining count and 
forgoing the many unnecessary loops) give a factor of 2 improvement. Plus the 
code is more readable IMHO.

A simple check shows that the expensive part is the regexp stuff. So you could 
optimize the counting trivially in C using weave, but it wouldn't buy you 
much: about 80% of the execution time is spent in the regexp findall.

So it looks like if you need even faster execution, you'll have to go beyond 
python, but writing regexp code by hand is _exactly_ the kind of thing for 
which I'll kick and scream to avoid C. Just have a cup of coffee while it 
finishes :)

Cheers,

f.

ps. Don't do 'cat file | script', that's what 'script < file' is for ;)

#------------------------------------------------------------------
#!/usr/bin/env python

def printout():
    "Formatted output of dictionary"
    for i in range( 1, longWord ):
        print "Length %d = #%d" % ( i, stats[i] )
    print "Length >%d = #%d" % ( longWord, stats[longWord] )

# import libraries and check for help request

import sys, re

if len( sys.argv ) != 1:
    print """
wordStatistic.py 1.0
--------------------
Reads from stdin and prints a word statistic.
Usage: wordStatistic.py < file
"""
    sys.exit( 0 )

defWord = "[A-Za-z]+"
wordExp = re.compile( defWord )

# initialize array, so we don't have to check for key-existance later
longWord = 15
stats = [0]*(longWord+1)

# main loop. iterates over standard input as long as there's
# something coming in...
for word in wordExp.findall( sys.stdin.read() ):
   try:
       stats[len(word)] += 1
   except IndexError:
       stats[-1] += 1

#printout()
#------------------------------------------------------------------