[Spambayes] Certain messages always flagged as Spam

Tim Peters tim.one at comcast.net
Sun Nov 23 14:55:45 EST 2003


[Zaidman, Jakov]
> The status message says that the database has over 9000 good messages
> and 64 bad messages.

Ouch.

> The huge number of good messages is based on the training with all
> of the old mail stored in my mailbox. I was under the impression that
> it would be a good thing to give it as much as possible good mail
> to train on, in order to reduce the possibility for false positives.

That's a decent strategy provided that you also train on an approximately
equal number of spam.

> What adjustments would you suggest I should make?

Tony gave you the best advice the first time:  balance your training data,
even if that means you train on only 64 ham too.  Better would be to train
on more spam.  You should find that once you've trained on (just) a couple
hundred of each, the classification performance will be very good.  From
early tests, training on many thousands of each only makes a difference if
you're worried about the fourth decimal digit.  An advantage to training on
relatively few messages of each kind is that your classifier will respond
much more quickly to training on mistakes.  You don't have to train on
everything to get excellent results.




More information about the Spambayes mailing list