[Spambayes] proposed changes to hammie & co.

Fri Nov 22 18:43:40 2002

So then, Rob Hooft <rob@hooft.net> is all like:

> Is this calculation for the few words in one message really
> time-determining? There is another way of caching: Make a dictionary
> that maps count-tuples to spam probabilities.
> 
>   (1,0) -> 0.155
>   (0,1) -> 0.844
> etc.

Hmm!  I did a small test against 200 spam, 200 ham, to see what tuple
frequency is like.  I got 21833 unique words, but only 869 unique values
for (spamcount, hamcount).  I also got gnuplot to animate out a cool
spinning 3D graph of it just as my boss walked by :)

The 20 most frequently-occuring (spamcount, hamcount) tuples were:

  (15, 0)  57
  (18 0)   57
  (19 0)   62
  (10 5)   65
  (0 20)   79
  (4 10)   98
  (5 10)   99
  (9 5)   113
  (14 0)  137
  (0 15)  153
  (13 0)  162
  (8 0)   288
  (4 5)   303
  (10 0)  317
  (5 5)   334
  (9 0)   611
  (0 10)  659
  (4 0)  4814
  (5 0)  4979
  (0 5)  6045

The 20 most infrequently-occurring were:

  (0, 130)   1
  (0, 135)   1
  (0, 140)   1
  (0, 155)   1
  (0, 165)   1
  (0, 175)   1
  (0, 250)   1
  (0, 285)   1
  (0, 310)   1
  (0, 725)   1
  (0, 75)    1
  (10, 30)   1
  (10, 40)   1
  (10, 85)   1
  (100, 40)  1
  (101, 115) 1
  (101, 20)  1
  (101, 25)  1
  (102, 115) 1
  (102, 20)  1

A graph of frequencies looks just a lot like a hyperbola:
<http://woozle.org/~neale/tmp/b.png>

The more I think about this caching scheme, the more I like it.  It
deals well with the fact that most of the words only occur a few times,
saves memory, and it will speed up pickles *and* databases.  It's going
in to the playground branch.

> I definitely wouldn't move the calculation into the wordinfo class. It
> is a different task, so it "should" (design) be a separate class....

Using this scheme, the calculation has to go back into the Bayes (or
Classifier) class.  WordInfo only stores counters now.

Neale