Optimizing a text statistics function
Peter Otten
__peter__ at web.de
Wed Apr 21 13:05:08 EDT 2004
Nickolay Kolev wrote:
> I am currently writing some simple functions in the process of learning
> Python. I have a task where the program has to read in a text file and
> display some statistics about the tokens in that file.
>
> The text I have been feeding it is Dickens' David Copperfield.
>
> It is really simple - it reads the file in memory, splits it on
> whitespace, strips punctuation characters and transforms all remaining
> elements to lowercase. It then looks through what has been left and
> creates a list of tuples (count, word) which contain each unique word
> and the number of time it appears in the text.
>
> The code (~30 lines and easy to read :-) can be found at
> http://www.uni-bonn.de/~nmkolev/python/textStats.py
>
> I am now looking for a way to make the whole thing run faster. I have
> already made many changes since the initial version, realising many
> mistakes. As I do not think of anything else, I thought I would ask the
> more knowledgeable.
>
> I find the two loops through the initial list a bit troubling. Could
> this be avoided?
>
> Any other remarks and suggestions will also be greatly appreciated.
I'd probably do this with regular expressions, either with r.split() where r
matches the spaces between the words or r.finditer() where r matches the
words.
But since you were asking for speed, another approach might be faster. First
I normalize all non-word characters to space and all alpha-chars to
lowercase. Note that this may yield different word frequencies as e.g.
"two.words" will be 2 words with my and one word with your approach.
from __future__ import division
import string, sys
def main(filename):
# caveat: assumes one byte per character
repl = string.whitespace + string.punctuation
tr = string.maketrans(repl, len(repl)*" ").lower()
assert len(tr) == 256
words = file(filename).read().translate(tr).split()
histogram = {}
wordCount = len(words)
for word in words:
histogram[word] = histogram.get(word, 0) + 1
wordCountList = [(c, w) for w, c in histogram.items()]
wordCountList.sort()
wordCountList.reverse()
print
for freq, word in wordCountList[:10]:
print '%15r %5d %5.2f%%' % (word, freq, freq / wordCount * 100)
print
if __name__ == "__main__":
main(sys.argv[1])
Other remarks:
Speed is not that important most of the time.
Psyco offers speed for free.
You might consider switching to 4-space indent.
Peter
More information about the Python-list
mailing list