[Spambayes] training problem?

Tim Peters tim.one at comcast.net
Tue Dec 2 15:53:16 EST 2003


[Kenny Pitt]
> SpamBayes will use at most 150 tokens to determine the spam
> probability, while the complete message has 684.  SpamBayes chooses
> the 150 strongest tokens (i.e. those with probabilities farthest from
> a neutral 0.5), and the rest are not used so are only shown in the
> Message Tokens section.

That's right.  Note that this 150 is the default value of the Classifier's
max_discriminators option.  Setting it much higher than that can cause
numerical problems in the inverse chi-squared probability computation,
specifically at the

    # XXX If x2 is very large, exp(-m) will underflow to 0.

comment in chi2Q().  Testing showed that the exact value of
max_discriminators didn't matter much, provided it was at least 30 (or so).
Then again, most emails don't have 150 tokens, let alone 150 strong ones.




More information about the Spambayes mailing list