Most Effective Way to Build Up a Histogram of Words?

Thu Oct 12 16:31:43 EDT 2000

"Stephen Kloder" <stephenk at cc.gatech.edu> wrote in message
news:39E605AE.495A8406 at cc.gatech.edu...
> June Kim wrote:
>
> >
> > and then how could I sort the dictionary according to the
> > frequency order?
> >
>
> Rather than creating a list-of-tuples data structure, you might want to
use
> Python's built-in sorting features:
> words=histogram.keys()
> words.sort(lambda x,y:cmp(histogram[y],histogram[x]))
> for w in words:
>     print "%s : %d"%(w,histogram[w])

If you have a *small* number of words, this is OK.  But, if the
words are many, then its speed is a big problem: the lambda
is called O(N log N) times, a large extra effort compared to
using Python's built-in comparison.

Consider a simple case, far smaller than what the original
poster mentioned as the typical size for the files to be
histogrammed.  Preparing the histogram and checking
the sizes of things...:

>>> filedata=open('c:/fp.htm').read()
>>> len(filedata)
502953
>>> words=filedata.split()
>>> len(words)
17809
>>> hist={}
>>> for word in words:
        hist[word] = 1 + hist.get(word, 0)

>>> len(hist)
6358

Now, define the sorting functions based on the two
different approaches:

>>> def sort1(hist):
 sorted = [ (count, word) for word, count in hist.items() ]
 sorted.sort()
 return sorted

>>> def sort2(hist):
 words = hist.keys()
 words.sort(lambda x,y, h=hist: cmp(h[x], h[y]))
 return words

Finally, measure performance.  profile is somewhat
uncalibrated in favour of the version with the least
number of calls, so it exaggerates things, but...:

>>> import profile
>>> profile.run("hh=sort1(hist)")
         3 function calls in 0.664 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.360    0.360    0.360    0.360 <pyshell#18>:1(sort1)
        1    0.000    0.000    0.360    0.360 <string>:1(?)
        1    0.304    0.304    0.664    0.664 profile:0(hh=sort1(hist))
        0    0.000             0.000          profile:0(profiler)

>>> profile.run("hh=sort2(hist)")
         60538 function calls in 7.448 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    3.666    3.666    7.414    7.414 <pyshell#23>:1(sort2)
    60535    3.748    0.000    3.748    0.000 <pyshell#23>:3(<lambda>)
        1    0.029    0.029    7.443    7.443 <string>:1(?)
        1    0.005    0.005    7.448    7.448 profile:0(hh=sort2(hist))
        0    0.000             0.000          profile:0(profiler)

>>>

Although not actually as bad as one order of magnitude when
sorting a few thousand entries, the difference is already quite
perceptible -- and when the entries grow in numbers, the
ratio favours the "sort *without* an explicit comparison
function by more and more.

I think this is called a "schwartzian transform" in the Perl
world, but the approach is just as valid in Python: for
performance, when sorting a largish number of entries,
do not use the feature of sort that lets you pass an explicit
comparison function (which will get called O(N log N)
times); rather, prepare a list of tuples (or other linear
structure) suitable for natural (lexicographical) sorting,
sort it "bare", and unravel it again if need be.  Preparation
and unraveling are O(N) tasks, and the part that remains
O(N log N), the actual sort, runs as fast as possible by
this approach.

Alex