Most Effective Way to Build Up a Histogram of Words?

Thu Oct 12 12:07:49 EDT 2000

"Steve Holden" <sholden at holdenweb.com> wrote in message
news:39E5D3CF.62C6CE8E at holdenweb.com...
> Simon Brunning wrote:
> >
> > > From: June Kim [SMTP:junaftnoon at nospamplzyahoo.com]
> > > What is the most effective way, in terms of the execution speed,
> > > to build up a histogram of words from a multiple of huge text
> > > files?
> >
> > June,
> > How huge? As a first cut, I'd try something like this (untested) -
> >
> > file = open('yourfile.txt', r)
> > filedata = file.read()
> > words=filedata.split()
> > histogram {}
> > for word in words:
> >         histogram[word] = histogram.get(word, 0) + 1
> > for word in histogram.keys():
> >         print 'Word: %s - count %s' % (word, str(histogram[word])
> >
> > This should work unless the file is *really* huge, in which case you'll
need
> > to read the file in a chunk at a time. But if you can squeeze the file
in
> > one gulp, do so.
> >
> > Cheers,
> > Simon Brunning
> > TriSystems Ltd.
> > sbrunning at trisystems.co.uk
> >
> Try the following with some sample files to see whether you have enough
> memory.
>
> I removed a couple of syntax errors, and re-cast it to be 1.5.2 compatible
> (since I don't run 2.0c1 on my laptop yet).  Tested -- seems to work OK.
>
> If you want the most frequent words last, remove the reverse() call.
>
> regards
>  Steve
> --------------------------------------------------------------------------
-
> import string
>
> file = open('histo.py', "r")
> filedata = file.read()
> words=string.split(filedata)
> histogram = {}
> for word in words:
>         histogram[word] = histogram.get(word, 0) + 1
> #for word in histogram.keys():
> #        print 'Word: %s - count %s' % (word, str(histogram[word]))
> flist = []
> for word, count in histogram.items():
>     flist.append([count, word])
> flist.sort()
> flist.reverse()
> for pair in flist:
>     print "%30s: %4d" % (pair[1], pair[0])
> --
> Helping people meet their information needs with training and technology.
> 703 967 0887      sholden at bellatlantic.net      http://www.holdenweb.com/
>
>

Thank you for your clear and clean code.
The problem, however, is that I might run through several of a few MB files,
summing up to tens of mega bytes when added into one file .
Therefore, to do the sorting all at once might sound somewhat unfeasible
or ineffecient. Am I trying to make Python a panacea here? ( I know it has
no snake oil though)

Best Regards,
June