Suggest more finesse, please. I/O and sequences.

Fri Mar 25 18:30:23 EST 2005

Qertoip wrote:
> Dnia Fri, 25 Mar 2005 12:51:59 -0800, Scott David Daniels napisał(a):
> > ...
>>         for word in line.split():
>>             try:
>>                 corpus[word] += 1
>>             except KeyError:
>>                 corpus[word] = 1
> 
> Above is (probably) not efficient when exception is thrown, that is most of
> the time (for any new word). However, I've just read about the following:
> corpus[word] = corpus.setdefault( word, 0 ) + 1

That is better for things like:
     corpus.setdefault(word, []).append(...)

You might prefer:

     corpus[word] = corpus.get(word, 0) + 1

The trade-off depends on the size of your test material.  You need
to time it with your mix of words.  I was thinking of cranking
through a huge body of text (so words of frequency 1 are by far
the minority case).  If you run through Shakespeare's first folio,
and just do the counting part, the try-except and .get cases are
indistinguishable (2.0 sec for each), and the .setdefault version
drags in at a slow 2.2 sec.  Just going through Anna Karenina,
again .83, .83 and .91.  So the .setdefault form is 10% slower.
For great test cases, (and for your own personal edification)
visit Project Gutenberg.

Beware when you do timing: whether the file is "warm" or not can
make a huge difference.  Read through it once before timing either.

--Scott David Daniels
Scott.Daniels at Acm.Org