[spambayes-dev] Mozilla SpamBayes "porting"

Sun Feb 22 23:13:21 EST 2004

[Miguel Vargas]
> Great.  I just confirmed that when I fixed my off-by-one error I got
> the correct value (0.822...).

Cool!

> This points to a problem in the section where I calculate the
> probability per token.  So then I noticed the 2 assertions from the
> probability function that I left out from my code
>
>          assert hamcount <= nham
>          assert spamcount <= nspam
>
> That is when I realized that we are counting the tokens differently.
> It looks like SpamBayes only counts a token once per message no
> matter how many times it appears.

There's a comment block about this in classifier.py, before the _add_msg()
method.  Graham's scheme was schizophrenic, counting duplicates more than
once during training, but only once during scoring.  See the comment for
more on that.

> Mozilla counts every instance of a token,

If it's still following Graham's scheme in this respect, I expect Mozilla's
scheme also differs between training and scoring.

> so hamcount can easily be greater than nham, that is eveident in the
> email I sent before

>>> ngood = 861, nbad = 759
>>> ...
>>> token 5: hamcount = 5802 spamcount = 4680

Then it's clear that a token could be counted more than once during training
(as in Graham's scheme), but is not enough to say whether scoring does or
doesn't weed out duplicates.  The current spambayes algorithms weed out
duplicates during training and scoring.