Optimizing a text statistics function

Nickolay Kolev nmkolev at uni-bonn.de
Wed Apr 21 10:51:56 EDT 2004


Hi all,

I am currently writing some simple functions in the process of learning 
Python. I have a task where the program has to read in a text file and 
display some statistics about the tokens in that file.

The text I have been feeding it is Dickens' David Copperfield.

It is really simple - it reads the file in memory, splits it on 
whitespace, strips punctuation characters and transforms all remaining 
elements to lowercase. It then looks through what has been left and 
creates a list of tuples (count, word) which contain each unique word 
and the number of time it appears in the text.

The code (~30 lines and easy to read :-) can be found at 
http://www.uni-bonn.de/~nmkolev/python/textStats.py

I am now looking for a way to make the whole thing run faster. I have 
already made many changes since the initial version, realising many 
mistakes. As I do not think of anything else, I thought I would ask the 
more knowledgeable.

I find the two loops through the initial list a bit troubling. Could 
this be avoided?

Any other remarks and suggestions will also be greatly appreciated.

Many thanks in advance,
Nicky



More information about the Python-list mailing list