[spambayes-dev] imbalance within ham or spam training sets?

Mon Nov 3 16:49:46 EST 2003

[T. Alexander Popiel]
> No.  Training on other mail which does not contain the word does not
> affect the score for a word at all ...

It's a bit curious that this is true only so long as the word has appeared
in only one kind of training data (only in spam, or only in ham).  As soon
as a word appears in at least one of each, training on msgs that don't
contain the word can change the word's score.

Example:  suppose we've trained on 100 ham and a 100 spam, and "lathe"
appeared in exactly one ham.  Its by-counting spamprob is then

>>> h = 1./100
>>> s = 0./100
>>> s/(h+s)
0.0
>>>

So long as we never see "lathe" in spam, s's numerator is 0 no matter how
many additional ham and spam we train on, so s is 0, so the by-counting
spamprob remains 0/(h+0) = 0.

Change the example so we've seen "lathe" in one ham and one spam:

>>> h = 1./100
>>> s = 1./100
>>> s/(h+s)
0.5
>>>

The by-counting spamprob is then 0.5, which makes fine intuitive sense.  Now
suppose we train on 100 more ham, and don't see "lathe" again:

>>> h = 1./200
>>> s = 1./100
>>> s/(h+s)
0.66666666666666674
>>>

Now "lathe" seems spammy!  It should, since we've seen it in a greater
percentage of spam than ham.  I'm not sure we've got the best guess to 17
significant digits, though <wink>.  Make the imbalance wilder and the
by-counting spamprob gets wilder too:

>>> h = 1./20000
>>> s = 1./100
>>> s/(h+s)
0.99502487562189057
>>>

That offends my intuition -- the word is so rare (2 of 20100 msgs) that it's
hard to believe that 99.5% is a sane guess.  The Bayesian adjustment knocks
it down a lot based on how few times it's been seen in total:

>>> (.45*.5 + 2.0*_)/(.45 + 2.0)
0.90410193928317584
>>>

But that still seems like a high guess to me.  The experimental ham/spam
imbalance option knocked it down a lot more.  Unfortunately, that also moved
spamprobs a lot closer to 0.5 for words that appeared lots of times in the
over-represented category, and that made it a Bad Idea overall.

It's tempting to ignore words that haven't appeared in at least N messages
total (for some N).  Alas, Graham's original algorithm had a gimmick like
that, and testing said it worked better not to have such a cutoff.  And for
the mistake-based training many of us have fallen into, scoring hapaxes is
very important.

So we can't ignore rare words -- but in the presence of strong imbalance, I
think we're still missing a trick.