[spambayes-dev] A spectacular false positive

Tim Peters tim.one at comcast.net
Sat Nov 15 16:42:47 EST 2003


[Rob Hooft]
> I am now training on all mistakes and unsures, plus all ham scoring
> more than 0.02 and all spam scoring less than 0.99.

Then why not reset your ham and spam cutoffs to 0.02 and 0.99, to match?
Then you can describe the same thing as just "mistakes and unsures" (which
is what I mean by "mistake-based training").

> Total trained messages is ~250 both ways, and 97+ of spam scores 0.99+
> leaving only 1-2 new spams per day, less than 1 unsure per day, and
> ~1 new ham per day to train on.
>
> I am really pleased by the performance of this training schedule. It
> is not as brittle as mistake-based training, but it still ignores the
> obvious repeating things like CVS log messages of which I receive a
> few dozen per day. It keeps the database reasonably small, but not
> really hapax driven.

Sigh -- we need solid research on training disciplines that work great in
real-life use, respecting that anything requiring human input will barely
get used except by geeks who never tire of watching the training process.
We're getting a lot of anecdotal evidence (which ain't the same thing) about
different schemes, and I'm afraid no two of the developers train in the same
way anymore.  It's a good thing the algorithm appears to have turned out to
be robust against almost any training insanity short of what Outlook users
can stumble into <0.9 wink>.

Oh well.  In the meantime, I think your msg would be a great addition to
Richie's spambayes wiki.  I know *you* know where that is, because a
coworker found your

    http://www.entrian.com/sbwiki/RobsSetup

there yesterday, and it was exactly what he needed to set up our code with
his maildir-based system.




More information about the spambayes-dev mailing list