Optimizing a text statistics function

Wed Apr 21 13:05:08 EDT 2004

Nickolay Kolev wrote:

> I am currently writing some simple functions in the process of learning
> Python. I have a task where the program has to read in a text file and
> display some statistics about the tokens in that file.
> 
> The text I have been feeding it is Dickens' David Copperfield.
> 
> It is really simple - it reads the file in memory, splits it on
> whitespace, strips punctuation characters and transforms all remaining
> elements to lowercase. It then looks through what has been left and
> creates a list of tuples (count, word) which contain each unique word
> and the number of time it appears in the text.
> 
> The code (~30 lines and easy to read :-) can be found at
> http://www.uni-bonn.de/~nmkolev/python/textStats.py
> 
> I am now looking for a way to make the whole thing run faster. I have
> already made many changes since the initial version, realising many
> mistakes. As I do not think of anything else, I thought I would ask the
> more knowledgeable.
> 
> I find the two loops through the initial list a bit troubling. Could
> this be avoided?
> 
> Any other remarks and suggestions will also be greatly appreciated.

I'd probably do this with regular expressions, either with r.split() where r
matches the spaces between the words or r.finditer() where r matches the
words.

But since you were asking for speed, another approach might be faster. First
I normalize all non-word characters to space and all alpha-chars to
lowercase. Note that this may yield different word frequencies as e.g.
"two.words" will be 2 words with my and one word with your approach.

from __future__ import division
import string, sys

def main(filename):
    # caveat: assumes one byte per character
    repl = string.whitespace + string.punctuation
    tr = string.maketrans(repl, len(repl)*" ").lower()
    assert len(tr) == 256

    words = file(filename).read().translate(tr).split()
    histogram = {}
    wordCount = len(words)
    for word in words:
        histogram[word] = histogram.get(word, 0) + 1

    wordCountList = [(c, w) for w, c in histogram.items()]
    wordCountList.sort()
    wordCountList.reverse()

    print
    for freq, word in wordCountList[:10]:
        print '%15r %5d     %5.2f%%' % (word, freq, freq / wordCount * 100)
    print

if __name__ == "__main__":
    main(sys.argv[1])

Other remarks: 
Speed is not that important most of the time.
Psyco offers speed for free.
You might consider switching to 4-space indent.

Peter