Optimizing a text statistics function

Wed Apr 21 18:34:06 EDT 2004

Scott David Daniels wrote:

> Peter Otten wrote:
>> Nickolay Kolev wrote:
> Playing along, simply because it's fun.
> 
>> def main(filename):
>>     ...
> 
> #>     words = file(filename).read().translate(tr).split()
> #>     histogram = {}
> #>     wordCount = len(words)
> #>     for word in words:
> #>         histogram[word] = histogram.get(word, 0) + 1
> 
> Better not to do several huge string allocs above (I suspect).
> This method lets you to work on files too large to read into memory:
> 
>        wordCount = 0
>        histogram = {}
> 
>        for line in file(filename):
>            words = line.translate(tr).split()
>            wordCount += len(words)
>            for word in words:
>                histogram[word] = histogram.get(word, 0) + 1
> 
>>     ...

In theory you are right, in practice most text files are tiny compared to
the amount of RAM available on a fairly recent machine.

However, I readily admit that your variant plays nicely with *very* large
files as long as they have newlines, and, contrary to my expectation, is
even a bit faster with my adhoc testcase, a 670,000 byte 8,700 line HTML
file.

Peter