[Spambayes] Re: suggestions for training and filtering?

Jacob Farmer jacob-spambayes-list at statisticalanomaly.com
Wed Dec 3 13:54:47 EST 2003


Seth,

I started out with about 300 of each.  I would always train on ham and 
unsures, and I would delete the spam.  However, as ham count in my 
database grew, I would classify some additional spam messages to keep 
the ratio even.  When I did that, I tried to train on a block of about 
100 messages (~3 days worth for me) at a time, so that I had a diverse 
enough sample to avoid skewing my results.

Once I got to the point where most of my messages were being properly 
sorted, I just started deleting the spam.  To be honest, I still train 
my unsures, but I get very, very few of them.

In addition, if I notice the number of unsures (or even messages that 
should be spam being marked as ham), I'll start saving new spam and when 
  I have enough to be at about a 1:1 ratio with my saved ham, I'll nuke 
the database and retrain it using the mail I've collected recently.

This system has worked out really well for me so far.

Jacob


This has worked well for me so far.

Seth Goodman wrote:
> He says he isn't training at all anymore.  My question for Jacob is what was
> the initial size of his training set and what were his criteria for training
> before he reached his present state?




More information about the Spambayes mailing list