Optimizing a text statistics function
Peter Otten
__peter__ at web.de
Wed Apr 21 18:34:06 EDT 2004
Scott David Daniels wrote:
> Peter Otten wrote:
>> Nickolay Kolev wrote:
> Playing along, simply because it's fun.
>
>> def main(filename):
>> ...
>
> #> words = file(filename).read().translate(tr).split()
> #> histogram = {}
> #> wordCount = len(words)
> #> for word in words:
> #> histogram[word] = histogram.get(word, 0) + 1
>
> Better not to do several huge string allocs above (I suspect).
> This method lets you to work on files too large to read into memory:
>
> wordCount = 0
> histogram = {}
>
> for line in file(filename):
> words = line.translate(tr).split()
> wordCount += len(words)
> for word in words:
> histogram[word] = histogram.get(word, 0) + 1
>
>> ...
In theory you are right, in practice most text files are tiny compared to
the amount of RAM available on a fairly recent machine.
However, I readily admit that your variant plays nicely with *very* large
files as long as they have newlines, and, contrary to my expectation, is
even a bit faster with my adhoc testcase, a 670,000 byte 8,700 line HTML
file.
Peter
More information about the Python-list
mailing list