Optimizing a text statistics function

Wed Apr 21 11:34:21 EDT 2004

On Wed, Apr 21, 2004 at 04:51:56PM +0200, Nickolay Kolev wrote:
> It is really simple - it reads the file in memory, splits it on 
> whitespace, strips punctuation characters and transforms all remaining 
> elements to lowercase. It then looks through what has been left and 
> creates a list of tuples (count, word) which contain each unique word 
> and the number of time it appears in the text.
> 
> The code (~30 lines and easy to read :-) can be found at 
> http://www.uni-bonn.de/~nmkolev/python/textStats.py
> 
> I am now looking for a way to make the whole thing run faster.

Do you actually need it to be faster?  If the answer is "no, but
it would be nice." then you are already done *wink*.

A good profiling strategy is to wrap each part in a function
so you can see which lines consume the most CPU.  Just make sure
to wrap big pieces so the function call overhead doesn't get
added ten thousand times and distort the picture.

You will get a bunch of suggestions on how to make the code faster,
so I'll skip those.  What you want to do is only do the expensive
parsing once and not every time you run your program.  Try pickle.
[untested code follows]

import pickle

def proc(txt_filename): # txt_filename like 'dickens.txt'
  ... exiting code ...
  reutrn wordCountList

def procWrap(txt_filename):
  cache_filename = tst_filename.replace('.txt', '.pickle')
  try:
    fob = open(cache_filename)
    wordCountList = pickle.load(fob)
  except IOError:
    wordCountList = proc(txt_filename)
    fob = open(cache_filename, 'w+')
    pickle.dump(wordCountList, fob, -1) # see the docs about the '-1'
  return wordCountList

Use procWrap() instead of proc() to get the list.  You'll need
to delete the .pickle file every time you change proc() so the
pickle gets refreshed.  This way you never have to care about how
effecient the parsing loop is, because you only have to call it once.

-jackdied