Most Effective Way to Build Up a Histogram of Words?

June Kim junaftnoon at nospamplzyahoo.com
Thu Oct 12 10:20:43 EDT 2000


Thank you Mr. Brunning.

"Simon Brunning" <SBrunning at trisystems.co.uk> wrote in message
news:mailman.971358210.26293.python-list at python.org...
> > From: June Kim [SMTP:junaftnoon at nospamplzyahoo.com]
> > What is the most effective way, in terms of the execution speed,
> > to build up a histogram of words from a multiple of huge text
> > files?
>
> June,
> How huge? As a first cut, I'd try something like this (untested) -

The files are of a few MBs.

>
> file = open('yourfile.txt', r)
> filedata = file.read()
> words=filedata.split()
> histogram {}
> for word in words:
> histogram[word] = histogram.get(word, 0) + 1
> for word in histogram.keys():
> print 'Word: %s - count %s' % (word, str(histogram[word])
>
> This should work unless the file is *really* huge, in which case you'll
need
> to read the file in a chunk at a time. But if you can squeeze the file in
> one gulp, do so.
>

and then how could I sort the dictionary according to the
frequency order?

> Cheers,
> Simon Brunning
> TriSystems Ltd.
> sbrunning at trisystems.co.uk
>

Best Regards,
June





More information about the Python-list mailing list