[Spambayes] more information please

Sat Jun 3 11:31:46 CEST 2006

> I have been leaving the category as SpamBayes set it for messages  
> it had correctly identified, so presumably have been "re-training"  
> it on ones it already got right.  I thought this was the correct  
> way, confirming that SB was right in those instances, or does it  
> mean that a bias of any sort could develop?

It's not 100% clear what the best training regime is.  Simulations so  
far, as well as anecdotal evidence, have shown that a 'mistake-based'  
training regime is probably best.  (For example, only training on  
false positives, false negatives and unsures, alternatively, training  
only on 'nonedge' messages (e.g. scoring between 10% and 90%)).

One reason these probably work better is that the databases end up  
smaller, which means that if 'random' real words are added to a spam,  
it is less likely that they are in your database (which means they  
are ignored).

> It would be good if clear instructions similar to the above were  
> included in the interface page below the list of mails processed so  
> it's there for easy reference.

If you click on the "Help" icon at the bottom of the page, it says  
pretty much what I did in the email, and has a link to the wiki where  
training options are discussed in more detail (since there isn't a  
definitive answer about what is best, it's hard to have a concise  
summary distributed with the software).  If you can think of ways  
that the help text could be improved, please let us know (IIRC I  
simply wrote what I thought of at the time, and it hasn't been  
reviewed since).

=Tony.Meyer

-- 
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.