[Spambayes] SpamBayes now filers less than 50% of my spam.

Kenny Pitt kennypitt at hotmail.com
Fri Nov 14 15:19:35 EST 2003


Skip Montanaro wrote:
> I've developed a few seat-of-the-pants training maxims, both from
> personal experience and from reading what others have done:
> 
>     * Bigger is not always better, no matter what all those
>       enlargement messages would have you believe.

That's a very good point, and there's a lot to be said for training on
the minimum number of messages that you need to produce reasonable
results.

Because all scores are based on ratios, every additional message that
you train on dilutes the effect of the prior tokens in that corpus that
don't appear in the new message.  For example, if I start with 50
trained hams and have a token that has been seen 10 times, it
contributes a ham probability of 0.2 (10/50) to the scoring.  If I later
train on 50 more hams that don't contain that token, it's ham
probability drops to 0.1 (10/100).

So training on more ham can actually cause you to miss good messages
that you were previously classifying correctly.

-- 
Kenny Pitt




More information about the Spambayes mailing list