[Spambayes] SpamBayes now filers less than 50% of my spam.

Fri Nov 14 18:23:54 EST 2003

> From: Kenny Pitt

> Because all scores are based on ratios, every additional message that
> you train on dilutes the effect of the prior tokens in that 
> corpus that
> don't appear in the new message.  For example, if I start with 50
> trained hams and have a token that has been seen 10 times, it
> contributes a ham probability of 0.2 (10/50) to the scoring.  
> If I later
> train on 50 more hams that don't contain that token, it's ham
> probability drops to 0.1 (10/100).

> So training on more ham can actually cause you to miss good messages
> that you were previously classifying correctly.

I don't think your examples are correct, mathematically. Your 0.2 and
0.1 don't take into account how often the token is seen in the other
corpus. The actual formula used by spambayes for the probabiltiy of a
given token is more complex, and requires looking at the size of both
corpuses and the number of occurances in each. See
http://www.paulgraham.com/spam.html for a good explanation of the
general method used.

So, in your example, if the token never occurred in a spam, your
single-token ham probabilites would actualy be something more like 0.99
and 0.99 instead of 0.2 and 0.1.

The way the probabilities are actually computed, the more data you have,
the more accurate your probabilities get, and the better the filter will
perform. Up to a point, of course... there will always be diminsighing
returns. There's not much difference in practical terms between 99.7%
accuracy and 99.8% accuracy.

Regards,
	Ryan