[spambayes-dev] RE: [Spambayes] question regarding training

Thu Aug 12 09:47:12 CEST 2004

> My sense is that when users have an imbalance problem,
> overwhelmingly the situation is that of this user, i.e. more 
> spam than ham. I'm about to say a couple of things that 
> depend on that assumption, so I just want to state it.

I would agree with that assumption, in general.

>> Firstly, if you are not already, then doing "train on mistakes"
>> is a good idea. This should reduce the imbalance, and make it
>> grow less quickly.
> 
> I don't see why.

True, train-on-mistakes might not reduce the imbalance compared to
train-on-everything.  This would only be true if the percentage of mistakes
that are spam is lower than the percentage of incoming mail that is spam.  I
should really have used "might" instead of "should" there.  In some cases,
it will, however.

The imbalance almost certainly will grow less quickly, though, because the
database size will grow much, much slower.

> The expectation should be that users will
> tune their cutoff values so that most of what goes into the 
> unsure folder is spam. If a user then processes every unsure 
> message into the database, this will increase, not decrease, 
> the imbalance.

I'm not sure that there is an expectation that users will so tune the
cutoffs, but I could be wrong.  I like my cutoffs so that most of what goes
in the unsure folder is unsure (I don't mean that facetiously - mail that
*I* am also unsure about).  I believe it would be fair to say that unsure
messages tend toward spam, and I think I've seen work that shows that ham
tends to be more homogenous than spam (which makes much logical sense,
although logical sense has little to do with any of this <wink>).  I
wouldn't expect though for a user to raise the ham cutoff to reduce the
amount of ham in the unsure folder, though (to me an unsure ham is much
better than a false negative).

I think one of the main ways that people can help the imbalance, if they are
already doing train-on-mistakes, is reducing the spam threshold a bit.  Both
Outlook and non- ship with a cutoff of 0.9, and I think quite a few people
can get much less spam in their unsure box with a rate more like 0.8.

> Depending (possibly) on your settings, moving messages to the
> spam folder, even manually, will process them into the 
> database. Right?

Yes, my mistake.  I long-ago turned off both the incremental training
options, and so ofttimes forget about them.  For me, manually moving a
message to the spam folder does no training, but by default it will.

> To me, the solution to the problem seems obvious and almost
> absurdly easy to implement: When the imbalance reaches a 
> certain level (determined by the Spambayes gurus), have the 
> program start training on every nth message it classifies as 
> ham. Do this until the desired balance is restored.

A while back now, I tried doing testing with various forms of auto-balancing
training.  The results were terrible.  I never managed to find time to
figure out why and how to resolve that, although I'd still like to.  In
fact, there is a feature request tracker still open (even assigned to me, I
think) that requests some sort of auto-balancing.

More recently, the reported success (by Skip with SpamBayes, and by others
with other things) of training-to-exhaustion, which implicitly keeps the
database balanced, makes me want to try that out, both with more testing and
in some sort of integrated fashion with sb_server/Outlook.

I did not try the exact scheme outlined above when I was doing my testing.
It would be easy enough to do so, if only the time was available.  If anyone
would like to run the incremental testing setup, I'm happy to write the
above into an appropriate regime.

=Tony Meyer