[spambayes-dev] A spectacular false positive

Rob Hooft rob at hooft.net
Sat Nov 15 04:27:57 EST 2003


Tim Peters wrote:

> I view that mostly as a danger of mistake-based training:  as I've mentioned
> before, mistake-based training tends toward being hapax-driven, and hapaxes
> are brittle.  There's nothing *inherently* spammy about, say, 16384, and
> because that's a power of 2 and I'm a computer geek, that *would* have
> appeared in several training ham if I hadn't fallen into mistake-based
> training (yes, 16384 had indeed appeared in one training spam).

I am now training on all mistakes and unsures, plus all ham scoring more 
than 0.02 and all spam scoring less than 0.99. Total trained messages is 
~250 both ways, and 97+ of spam scores 0.99+ leaving only 1-2 new spams 
per day, less than 1 unsure per day, and ~1 new ham per day to train on.

I am really pleased by the performance of this training schedule. It is 
not as brittle as mistake-based training, but it still ignores the 
obvious repeating things like CVS log messages of which I receive a few 
dozen per day. It keeps the database reasonably small, but not really 
hapax driven.

Rob

-- 
Rob W.W. Hooft  ||  rob at hooft.net  ||  http://www.hooft.net/people/rob/




More information about the spambayes-dev mailing list