[Spambayes] A Couple of Training Questions

David Abrahams dave at boost-consulting.com
Tue May 8 16:40:04 CEST 2007


Q1:

I have a cron job that runs sb_imapfilter.py to train periodically
from my ham/spam corpus folders.

AFAICT, that will train only as-yet-untrained messages.  I know
there's supposed to be something about keeping ham and spam
balanced. If I start out with 1000 messages in each folder, then dump
10 into just the ham folder, the next training run will train 10 hams
and no spams.  Is that very bad for future performance, or is that
temporary imbalance strongly mitigated by the overall size of the two
folders?

Q2:

I notice that the incremental training of sb_imapfilter trains all
(as-yet-untrained) hams, then all (as-yet-untrained) spams.  However,
Skip's train-to-exhaustion script tries to interleave training of Hams
and Spams.  Is that interleaving only important for
train-to-exhaustion, or should all methods use it?

-- 
Dave Abrahams
Boost Consulting
http://www.boost-consulting.com

Don't Miss BoostCon 2007! ==> http://www.boostcon.com



More information about the SpamBayes mailing list