Most Effective Way to Build Up a Histogram of Words?

Fri Oct 13 08:48:49 EDT 2000

"Quinn Dunkan" <quinn at mark.ugcs.caltech.edu> wrote in message
news:slrn8uc5v3.k2j.quinn at mark.ugcs.caltech.edu...
    [snip]
> >> for word in words:
> >>     histogram[word] = histogram.get(word, 0) + 1
    [snip]
> Also, if you're using python 2.0, you could use a slightly more efficient
>
> historgram.setdefault(word, 1)
>
> instead of:
>
> histogram[word] = histogram.get(word, 0) + 1

Not really: if you used setdefault instead of the above-quoted
line in the for loop's body, you would not be counting words,
just marking with a '1' those which are present.  You do need
the '+ 1', in some form, to do the actual *counting*.

The new .setdefault is useful mostly when what you want to put
in correspondence with an entry is some *mutable* object.  When
the object is an integer (or otherwise immutable), you need a
    dictionary[key] = newvalue
anyway, so the .setdefault has no real advantage over a good
old .get -- in fact, it can be very-fractionally-slower.  When
the object is mutable, and your fondest desire is to call:
    dictionary[key].somemutation()
but you need to ensure that, if key was not already present,
its entry IS initialized to a fresh new object before you
start mutating it, _then_ the new method is useful...:
    dictionary.setdefault(key,NewObject()).somemutation()

Actually, even in this case, you have to take a little care
if you do care about performance... note that a NewObject()
is called *anyway* in this idiom -- even when .setdefault
is going to ignore the result -- since Python is 'strict',
aka 'eager', in its evaluations (first the arguments are
fully evaluated, then the function is called; almost every
other language behaves this way, too, except for a few [not
very widely used] languages such as Haskell, which have
by default 'lazy', aka 'normal order', evaluation).  So,
you have to look carefully at the cost of evaluation of
that new-object constructors... if it's high enough, then
the new idiom may not be affordable, particularly if the
case in which the key is already present is very frequent
(i.e., 'new' keys, not-already-present, are rare).  This
suggests the exception-idiom over the checking-idiom:
    try:
        dictionary[key].somemutation()
    except KeyError:
        dictionary[key] = NewObject()
        dictionary[key].somemutation()
rather than the more compact:
    if not dictionary.has_key(key):
        dictionary[key] = NewObject()
    dictionary[key].somemutation()
but either might be better, performance-wise, than the
use of .setdefault in this case.

The advantage of .setdefault is big when the possibly
unused new-constructor is very cheap...:
    dic.setdefault(key,[]).append(something)
and this specific usage is probably going to be the
dominant case of setdefault, since dictionaries of
lists are such a natural and widespread datastructure,
and the 'very cheap constructor' applies forcefully!-)

Alex