[Spambayes] proposed changes to hammie & co.
Neale Pickett
neale@woozle.org
Fri Nov 22 18:43:40 2002
So then, Rob Hooft <rob@hooft.net> is all like:
> Is this calculation for the few words in one message really
> time-determining? There is another way of caching: Make a dictionary
> that maps count-tuples to spam probabilities.
>
> (1,0) -> 0.155
> (0,1) -> 0.844
> etc.
Hmm! I did a small test against 200 spam, 200 ham, to see what tuple
frequency is like. I got 21833 unique words, but only 869 unique values
for (spamcount, hamcount). I also got gnuplot to animate out a cool
spinning 3D graph of it just as my boss walked by :)
The 20 most frequently-occuring (spamcount, hamcount) tuples were:
(15, 0) 57
(18 0) 57
(19 0) 62
(10 5) 65
(0 20) 79
(4 10) 98
(5 10) 99
(9 5) 113
(14 0) 137
(0 15) 153
(13 0) 162
(8 0) 288
(4 5) 303
(10 0) 317
(5 5) 334
(9 0) 611
(0 10) 659
(4 0) 4814
(5 0) 4979
(0 5) 6045
The 20 most infrequently-occurring were:
(0, 130) 1
(0, 135) 1
(0, 140) 1
(0, 155) 1
(0, 165) 1
(0, 175) 1
(0, 250) 1
(0, 285) 1
(0, 310) 1
(0, 725) 1
(0, 75) 1
(10, 30) 1
(10, 40) 1
(10, 85) 1
(100, 40) 1
(101, 115) 1
(101, 20) 1
(101, 25) 1
(102, 115) 1
(102, 20) 1
A graph of frequencies looks just a lot like a hyperbola:
<http://woozle.org/~neale/tmp/b.png>
The more I think about this caching scheme, the more I like it. It
deals well with the fact that most of the words only occur a few times,
saves memory, and it will speed up pickles *and* databases. It's going
in to the playground branch.
> I definitely wouldn't move the calculation into the wordinfo class. It
> is a different task, so it "should" (design) be a separate class....
Using this scheme, the calculation has to go back into the Bayes (or
Classifier) class. WordInfo only stores counters now.
Neale
More information about the Spambayes
mailing list