[Spambayes] SpamBayes now filers less than 50% of my spam.

Mon Nov 17 09:58:42 EST 2003

Ryan Malayter wrote:
>> From: Kenny Pitt
> 
>> Because all scores are based on ratios, every additional message that
>> you train on dilutes the effect of the prior tokens in that
>> corpus that
>> don't appear in the new message.  For example, if I start with 50
>> trained hams and have a token that has been seen 10 times, it
>> contributes a ham probability of 0.2 (10/50) to the scoring.
>> If I later
>> train on 50 more hams that don't contain that token, it's ham
>> probability drops to 0.1 (10/100).
> 
>> So training on more ham can actually cause you to miss good messages
>> that you were previously classifying correctly.
> 
> I don't think your examples are correct, mathematically. Your 0.2 and
> 0.1 don't take into account how often the token is seen in the other
> corpus. The actual formula used by spambayes for the probabiltiy of a
> given token is more complex, and requires looking at the size of both
> corpuses and the number of occurances in each. See
> http://www.paulgraham.com/spam.html for a good explanation of the
> general method used.
> 
> So, in your example, if the token never occurred in a spam, your
> single-token ham probabilites would actualy be something more like
> 0.99 and 0.99 instead of 0.2 and 0.1.
> 
> The way the probabilities are actually computed, the more data you
> have, the more accurate your probabilities get, and the better the
> filter will perform. Up to a point, of course... there will always be
> diminsighing returns. There's not much difference in practical terms
> between 99.7% accuracy and 99.8% accuracy.

My mistake, I should have used the word *ratio* instead of *probability*
in this case.  There are a number of factors that contribute to the
final probability including unknown word strength, the Robinson "rare
word" adjustment, the optional ham/spam imbalance adjustment, etc.  I
was trying to simplify somewhat, and hopefully did not over-simplify to
the point of incorrectness.

The intended point is this.  Once you have trained on a sufficient
number of messages, there are only two components that contribute
significantly to the spam probability of a single word: the ratio of ham
token count to total ham message count and the ratio of spam token count
to total spam message count.  If the spam ratio and the ham token count
do not change, the computed probability for that word will get
*spammier* if I increase the total *ham* message count.

Note that none of this has any effect if the token has only ever been
seen in one corpus or the other.  Before the Robinson adjustment, the
spam probability of a token that has only appeared in spam will be 1.0
regardless of whether it appeared once in two messages or once in
200,000 messages.  Similiary, the spam probability of a token that has
only appeared in ham will be 0.0.  The total number of messages becomes
a factor at the point when a word has been seen in both ham and spam and
we need to decide which of the two is more likely.

For those with a statistical mindset who want to follow this computation
all the way through, here is the complete formula (as I understand it)
that is used by SpamBayes to compute the spam probability of a single
word when the ham/spam imbalance adjustment is *not* enabled:

hc = ham token count
nh = total number of ham messages
sc = spam token count
ns = total number of spam messages
hr = ham ratio = hc / nh
sr = spam ratio = sc / ns
p = base spam probability = sr / (sr + hr)

S = unknown word strength (static factor = 0.45 by default)
x = unknown word probability (static factor = 0.5 by default)

n = total number of messages = nh + ns
sp = final spam probability = ((S * x) + (n * p)) / (S + n)

-- 
Kenny Pitt