[Spambayes] A Couple of Training Questions

David Abrahams dave at boost-consulting.com
Tue May 8 16:40:04 CEST 2007


I have a cron job that runs sb_imapfilter.py to train periodically
from my ham/spam corpus folders.

AFAICT, that will train only as-yet-untrained messages.  I know
there's supposed to be something about keeping ham and spam
balanced. If I start out with 1000 messages in each folder, then dump
10 into just the ham folder, the next training run will train 10 hams
and no spams.  Is that very bad for future performance, or is that
temporary imbalance strongly mitigated by the overall size of the two


I notice that the incremental training of sb_imapfilter trains all
(as-yet-untrained) hams, then all (as-yet-untrained) spams.  However,
Skip's train-to-exhaustion script tries to interleave training of Hams
and Spams.  Is that interleaving only important for
train-to-exhaustion, or should all methods use it?

Dave Abrahams
Boost Consulting

Don't Miss BoostCon 2007! ==> http://www.boostcon.com

More information about the SpamBayes mailing list