[Spambayes] Training on unusual ham - revisited

Sun Feb 12 22:49:43 CET 2006

On Saturday, February 11, 2006 10:11 PM -0600, Tony Meyer wrote:

> [Seth Goodman]
> > I think the problem is more that Spambayes doesn't do anything to
> > encourage sensible training schemes.
>
> I don't agree here.  The Outlook plug-in encourages train-on-error,
> because the simplest training is clicking the 'Spam' or 'Not Spam'
> buttons for mistakes (or dragging the messages to their proper
> place).  Train-on-error (fpfnunsure) seems to be one of the best
> regimes based on the testing done so far.

The difficulty with train-on-error has to do more with the threshold
setup than the scheme itself.  The spam threshold is set high enough to
minimize false positives (ham classified as spam), but not so high to
overwhelm the user with unsures.  To make Spambayes a good deal more
friendly to the user, you can increase the ham threshold slightly to
greatly reduce the amount of ham winding up in the unsure folder.  I do
this myself and find that it very effective.  It has a negligible effect
on false negatives, no effect on spam classified as unsure and mainly
reduces ham classified as unsure.  The result is that once properly
trained, virtually everything in the unsure folder is spam.  Training on
all of it, all the time, is guaranteed to give you a badly skewed, as
well as overly large database.

>From what I can tell, depending on thresholds, most people achieve on
the order of a few percent of their spam going to the unsure folder.  If
you have a large spam load, training on even a few percent of that
amount daily will train too many messages.  The scenario for which
train-on-errors works well is when the unsure folder contains roughly
equal amounts of ham and spam, assuming the number of unsures is much
larger than false positives plus false negatives.  Setting your ham
threshold low enough to make that true will solve the ham/spam imbalance
problem, but it is still undesirable from two standpoints:

1) If your spam load is higher than your ham load, you will have to
divert a significant fraction of your ham to unsure to have the ham/spam
ratio of this folder to be near unity.  For example, let's say 5% of
your spam classifies as unsure, and your incoming ratio of ham:spam is
1:3.  You would need 15% of your ham to classify as unsure to maintain a
balanced training set using train-on-errors.  This is not a convenient
way to operate the system.

2) If your spam load is significant, you will be training on a lot of
messages every day.  For example, if your spam load is 200/day, and 5%
of those classify as unsure, you will train on 10 unsure spam every day.
To keep things balanced, you should also train on 10 ham every day.
That is 20 messages/day, or around 7,000 per year.  This is not
desirable.

> [Seth Goodman]
> > It wouldn't be responsible for the
> > developers to force one scheme or another on the users, since there
> > is no proof that any one particular scheme would work for the
> > majority of users.
>
> I think that the testing that has been done certainly indicates that
> fpfnunsure, nonedge, and tte are all superior to train-on-everything
> in almost any situation.  (My TREC tests are the main contra-example
> I can think of, but they are clouded by the lack of the unsure range).

Yes, people should be aware that train-on-everything not only doesn't
work as well as other methods, but quickly leads to very large
databases.

> I think that the developers should set things up so that the simplest
> regime for users is one that is most likely to give results, while
> allowing users to use something else if they like.  I think sb_server
> does this fairly well, since it's easy to change the default actions
> so that you get train-on-everything with the least amount of work, or
> nonedge with the least amount of work.

I use the Outlook plug-in rather than sb_server, so I can't comment.  Of
the training methods that you mentioned above that work well, the only
one that is easy in the Outlook plug-in is train-on-errors
(perpetually).

> [Seth Goodman]
> > For example, a lot of spam has "word salad" added as hidden text to
> > confuse Bayesian filters like Spambayes.
> [...]
>
> Random 'word salad' has most often been shown to help statistical
> filters like SpamBayes, not harm it.

Agreed.  These ones are not the problem.

<...>

> More clever spam, that include less random noise (e.g. newspaper
> clippings) are more of an issue.

These are the problem and you are better off not training on them.
Unfortunately, you have to look at them one way or another to determine
that.

> It is hard to try and explain this art to the average Outlook user,
> however.  (Suggestions are welcome ;)

I completely agree.  Because it is hard to explain, hard for some people
to understand and many people (most?) won't want to go to this much
trouble, I think a little more automation for the Outlook plug-in to
make the better training schemes more practical might help.  For
example, one idea previously mentioned was an option to force initial
training to use an equal number of ham and spam by training on the
lesser of the number of each presented for training.  After initial
training, another possibility is a checkbox to force a filtering
operation each time you trained on another message.  When you have half
a dozen similar spam messages in your unsure folder, training on just
one of them will often take care of the others.  Unless you go through
the manual effort of filtering after each training event, you won't know
that.  This second one alone would probably be the best way to
discourage overtraining.  To avoid the database from getting imbalanced,
informational text boxes should probably pop us when a training event
causes the imbalance to cross a given threshold.

> [Seth Goodman]
> > Finally, unless Spambayes implements some form of pruning old
> > messages from the database, [...]
>
> Note that if pruning is done, it's not clear that age should be the
> deciding factor.  Then what happens to that once-a-year-ham?

I completely agree, age is just one possibility and not necessarily the
best.  If you did use age as the pruning criteria, and there are
once-a-year ham or spam you want to retain, you would set your age for
pruning at significantly larger than a year, say double that value.
This does mean that your training scheme should preferably not create
huge databases over that period of time.

The one advantage of age-based pruning is the notion that the message
stream changes over time.  From what I have (anecdotally) observed, it
changes a lot slower than I would have guessed.  This is good, as it
allows keeping a relatively long history in the database.  You don't
have to keep all trained messages around, only the set of tokens you
trained on for each one.  That is still a large database, but it is only
used occasionally.  For example, let's say you set a pruning age of two
years and set the pruning interval to 1% of that.  Pruning would only
occur around once a week.

--
Seth Goodman