Most Effective Way to Build Up a Histogram of Words?

Alex Martelli aleaxit at yahoo.com
Thu Oct 12 12:17:46 EDT 2000


"June Kim" <junaftnoon at nospamplzyahoo.com> wrote in message
news:8s4hc3$a5a$1 at news.nuri.net...
> Thank you Mr. Brunning.
>
> "Simon Brunning" <SBrunning at trisystems.co.uk> wrote in message
> news:mailman.971358210.26293.python-list at python.org...
> > > From: June Kim [SMTP:junaftnoon at nospamplzyahoo.com]
> > > What is the most effective way, in terms of the execution speed,
> > > to build up a histogram of words from a multiple of huge text
> > > files?
> >
> > June,
> > How huge? As a first cut, I'd try something like this (untested) -
>
> The files are of a few MBs.

Reading and splitting a file of a few MB should not be a problem
with the sort of RAM amounts that equip a typical PC of today!-).

So, the following suggestion should be fine:

> > file = open('yourfile.txt', r)
> > filedata = file.read()
> > words=filedata.split()
> > histogram {}
> > for word in words:
> > histogram[word] = histogram.get(word, 0) + 1
> > for word in histogram.keys():
> > print 'Word: %s - count %s' % (word, str(histogram[word])
> >
> > This should work unless the file is *really* huge, in which case you'll
> need
> > to read the file in a chunk at a time. But if you can squeeze the file
in
> > one gulp, do so.

Good advice indeed!

> and then how could I sort the dictionary according to the
> frequency order?

Assuming for example you want biggest-first...:

sorted = [ (-count, word) for word, count in histogram.items() ]
sorted.sort()

for negcount, word in sorted:
    print "%6d: %s" % (-negcount, word)


Alex






More information about the Python-list mailing list