[Spambayes] Trained two times as much spam as ham

Kenny Pitt kenny.pitt at gmail.com
Tue Jan 18 22:19:42 CET 2005


Rick Friedman wrote:
> I was just wondering about the ratio of spam to ham trained.
> 
> I've been training on errors & unsures. So far, I've trained 126 spams
> and 51 hams. I keep hearing that we should strive to keep the training
> ratio at about 1:1.

2.5:1 is not badly unbalanced.  Most times when people are having a problem
it is because their imbalance is 20:1, 50:1, or even worse.

> Spambayes is working very well with the current training. I can't
> remember the last time an email was misclassified.

This is really the key criteria.  Imbalance can cause some issues in the
mathematical formulas, but whether or not that has any real effect on your
accuracy depends on your particular e-mail mix.  If your accuracy is fine
then you haven't got a problem even if your imbalance is 100:1.

> However, I do still
> get man unsures which, inevitably, turn out to be spam. I then train
> Spambayes on those unsures.

This is pretty typical, and I see the same effect in my own mail.  My good
messages have very similar characteristics that make them easy to identify
correctly.  On the other hand, there is a huge, constantly changing variety
of spam.

You may be able to help this some by adjusting your spam threshold.  On the
Filtering tab in SpamBayes Manager, you'll see a cutoff score in the Certain
Spam section that defaults to 90.0.  We set the default relatively high
because we want to minimize false positives.  However, most people can
reduce the threshold to get more unsures to classify as spam without causing
other problems.  75 is probably a reasonable value.  I personally run mine
at 60 and haven't had a single false positive since I last retrained, but I
wouldn't necessarily consider that typical.

> Obviously, my concern is that Spambayes' effectiveness will diminish
> as I continue to train more on more spam. The only time I seem to
> train as ham is when a ham email shows as unsure (which is few & far
> between). 
> 
> Am I right to be concerned about this, apparent, continually growing,
> imbalance in the training ratio? If so, what should I do about it?

Yes and no.  The nature of spam makes it highly likely that the imbalance
will continue to grow.  As developers, we are very concerned about this and
are trying to come up with some ideas to improve the situation.
Unfortunately, it's a difficult problem to solve in a general way.

In a practical sense, though, it probably won't become a huge problem for
you.  The imbalance will grow, but it probably won't grow fast enough to
reach dangerous levels.  If you do reach a point where accuracy is reduced,
that may mean that you'll have more unsure hams to train on for a while to
pull it back in line.

If it really becomes a problem, you can just reset your training and spend a
couple of days retraining SpamBayes from scratch.  We've found that it
doesn't take very long at all for SpamBayes to get back to very high
accuracy rates even when starting from nothing.

-- 
Kenny Pitt



More information about the Spambayes mailing list