Most Effective Way to Build Up a Histogram of Words?

Thu Oct 12 09:22:16 EDT 2000

June Kim wrote:
> 
> What is the most effective way, in terms of the execution speed,
> to build up a histogram of words from a multiple of huge text
> files?
> 
Ignoring the obvious consideration that it depends what you mean
by a *word*, I assume you know how you can split the text lines in
an appropriate way.

If not, then this will doubtless become a long and interesting thread.

> (NOTE: I meant HISTOGRAM a list of all words occuring from texts,
> with their frequency)
> 
Again, I assume you are more concerned with frequency counts than
a pretty graphic (which again can be done relatively easily in Python).

> Can anyone give me an insight?

The key would be to use a dictionary to hold the frequency counts.
If you have a candidate word in variable w, and have initialised the
dict variable with:

	dict = {}

the key statement would be

	dict[w] = w.get(w,0)+1

which adds one to the current entry for the word, unless this is the
first occurrence, in which case it adds one to zero and stores that in
the dictionary.

As far as reading lots of file goes, take a look at the fileinput
module, which is good at dealing with multiple input streams without
you having to deal with them.

Does this help?

hoping-I'm-not-completely-off-track-ly y'rs  -  Steve
--
Helping people meet their information needs with training and technology.
703 967 0887      sholden at bellatlantic.net      http://www.holdenweb.com/