Most Effective Way to Build Up a Histogram of Words?

Thu Oct 12 09:40:49 EDT 2000

> From:	June Kim [SMTP:junaftnoon at nospamplzyahoo.com]
> What is the most effective way, in terms of the execution speed,
> to build up a histogram of words from a multiple of huge text
> files?

June,
How huge? As a first cut, I'd try something like this (untested) -

file = open('yourfile.txt', r)
filedata = file.read()
words=filedata.split()
histogram {}
for word in words:
	histogram[word] = histogram.get(word, 0) + 1
for word in histogram.keys():
	print 'Word: %s - count %s' % (word, str(histogram[word])

This should work unless the file is *really* huge, in which case you'll need
to read the file in a chunk at a time. But if you can squeeze the file in
one gulp, do so.

Cheers,
Simon Brunning
TriSystems Ltd.
sbrunning at trisystems.co.uk

-----------------------------------------------------------------------
The information in this email is confidential and may be legally privileged.
It is intended solely for the addressee. Access to this email by anyone else
is unauthorised. If you are not the intended recipient, any disclosure,
copying, distribution, or any action taken or omitted to be taken in
reliance on it, is prohibited and may be unlawful. TriSystems Ltd. cannot
accept liability for statements made which are clearly the senders own.