[Spambayes] Moving closer to Gary's ideal

Tim Peters tim.one@comcast.net
Sat, 21 Sep 2002 16:19:16 -0400


[Neil Schemenauer, on fiddling the Robinson scheme to ignore words w/
 spamprob less than 0.1 away from 0.5]

> For me, that's enough to match the performance default setup:

Cool!
>     [Classifier]
>     use_robinson_probability: True
>     use_robinson_combining: True
>     max_discriminators: 1500
>
>     [TestDriver]
>     spam_cutoff: 0.6

Damn, I wish we had a better handle on guessing a good value for this a
priori.  It seems to get less sensitive the more the spam and ham means get
separated.

When a current batch of mini-tests finishes running, I'm going to check in a
new option:

[Classifier]
# When scoring a message, ignore all words with
# abs(word.spamprob - 0.5) < robinson_minimum_prob_strength.
# By default (0.0), nothing is ignored.
# Tim got a pretty clear improvement in f-n rate on his hasn't-improved-in-
# a-long-time large c.l.py test by using 0.1.  No other values have been
# tried yet.
# Neil Schemenauer also reported good results from 0.1, making the all-
# Robinson scheme match the all-default Graham-like scheme on a smaller
# and different corpus.
# NOTE:  Changing this may change the best spam_cutoff value for your
# corpus.  Since one effect is to separate the means more, you'll probably
# want a higher spam_cutoff.
robinson_minimum_prob_strength: 0.0