[spambayes-dev] imbalance within ham or spam training sets?

Mon Nov 3 18:43:32 EST 2003

>> [T. Alexander Popiel]
>>> No.  Training on other mail which does not contain the word does not
>>> affect the score for a word at all ...

[Tim]
>> It's a bit curious that this is true only so long as the word has
>> appeared in only one kind of training data (only in spam, or only in
>> ham).  As soon as a word appears in at least one of each, training
>> on msgs that don't contain the word can change the word's score.
>> ...

[Alex]
> Yarg.  I stand corrected.
>
> Perhaps it's time to test a variation where the prob is based on
> hamcount and spamcount instead of hamratio and spamratio.  Hrm.
> *tap, tap, tap*  I'll be back in a few hours...

Well, they're all the same if the # of training ham == the # of training
spam.  Computing spambprobs based on ratios is a first attempt at surviving
in the face of unbalanced training data.  For example, if a token appeared
in 99 of 100 spam, and 100 of 10,000 ham, a spamprob of 0.5 (100/(100+100))
doesn't make intuitive sense.  In effect, computing based on ratios (s/(s+h)
where s = 99/100 and h=100/10000) answers what would happen *if* we had
trained on equal numbers of each, while keeping the percentages of ham and
spam containing the token fixed.  In the example, if 99 of 100 spam
contained a given token, then our best guess is that, if we had seen 10,000
spam instead, we would have seen the token in 9,900 of those.  Then
9900/(9900+100) gives the same result as the current s/(s+h).

IOW, s/(s+h) gives the result that "prob is based on hamcount and spamcount"
gives if we extrapolate our actual training data to what it would be if it
were balanced.  If it's already balanced, the computed spamprob is the same
whether computed by raw count or by ratio.  So if you try raw count, the
only interesting tests would be on unbalanced training data.