[Spambayes] incremental training strategies

Jim Bublitz jbublitz@nwinternet.com
Mon Oct 28 17:34:58 2002


On 28-Oct-02 T. Alexander Popiel wrote:
> In message:  <15805.26237.16266.425547@montanaro.dyndns.org>
>              Skip Montanaro <skip@pobox.com> writes:

>> I am now running hammie.py from my procmailrc file, but not yet
>> doing any filtering based on the results.  I trained it on my
>> current setup (7000 hams, 5000 spams).  Should I:

>>    * train it on every message which passes through my inbox

>>    * only train it on messages which it incorrectly classifies

>>    * some other scheme

>>?  Or is that not yet known?

> Speaking from a theoretical purity standpoint, I suspect that
> training it on everything that came through would be
> 'cleaner'... but I have no idea if in practise it would work any
> better than just training on the mistakes and unsure.
 
> Try out variations, and post results?

I ran tests in chronological order where I trained on 4000 of each
type of msg and then:

a. Tested 8000 msgs of each type without retraining

b. Tested 8000 msgs of each type, retraining on all new msgs after
each batch of 100 spam/100 ham

b gave clearly better results by nearly an order of magnitude, but
that's only 1% or 2% vs. 0.1% or 0.2% at most, so in absolute terms
the effect might not be huge depending on mail volume.

In theory a closed-loop system should give more accurate results,
but it also requires some measures to make sure the retraining data
is clean or performance will probably degrade more quickly than if
you never retrain at all.


Jim