[Spambayes] lots of unsures, heavily biased towards spam

Seth Goodman sethg at goodmanassociates.com
Sat Feb 3 22:41:24 CET 2007


spambayes-bounces at python.org <> wrote on Saturday, February 03, 2007
3:15 PM -0600:

> I'm getting what the title says.  I very rarely see ham classified as
> unsure, and I get a few hundred unsures per day.  I keep training on
> the unsures, which means my database accumulates lots more spam than
> ham over time.  Is there anything I can do to help reduce the number
> of messages classified as unsure without hurting Spambayes' ability to
> correctly recognize ham?

If your training set has much more spam than ham, you can train on ham
that already scores properly.  Whether you choose ham that scores very
low already (typical ham) or the highest scoring ham (unusual ham) is
your preference.  If you use the Outlook plugin, just move the ham you
want to train on to the unsure folder and tell Spambayes it's not spam.
How much trained ham/spam imbalance is too much is also up for debate.
Some people have reported good results with 5:1 and even 10:1 imbalance,
while others do poorly under those conditions.  I try to avoid mine
going further than 2:1 and train on my highest scoring ham to fix it.
This seems to work better for me than training only on unsures.

Another underappreciated issue with all self-learning classifiers is
that they are very sensitive to training mistakes.  Training a couple of
messages in the wrong category can really change the outcome, and the
Outlook plugin doesn't tell you which messages are trained and whether
you trained them as ham or spam.  You have to figure this out
indirectly, usually by rescoring all your messages and looking for
obvious errors.  With a large set of messages, the likelihood of
spotting a training mistake goes down.  Fortunately, it's not hard to
start from scratch, so this is a reasonable thing to try if things are
not working as well as they should.

Please let us know what you try, what helps and what doesn't.

--
Seth Goodman



More information about the SpamBayes mailing list