[Spambayes] Moving closer to Gary's ideal
Tim Peters
tim.one@comcast.net
Sat, 21 Sep 2002 16:19:16 -0400
[Neil Schemenauer, on fiddling the Robinson scheme to ignore words w/
spamprob less than 0.1 away from 0.5]
> For me, that's enough to match the performance default setup:
Cool!
> [Classifier]
> use_robinson_probability: True
> use_robinson_combining: True
> max_discriminators: 1500
>
> [TestDriver]
> spam_cutoff: 0.6
Damn, I wish we had a better handle on guessing a good value for this a
priori. It seems to get less sensitive the more the spam and ham means get
separated.
When a current batch of mini-tests finishes running, I'm going to check in a
new option:
[Classifier]
# When scoring a message, ignore all words with
# abs(word.spamprob - 0.5) < robinson_minimum_prob_strength.
# By default (0.0), nothing is ignored.
# Tim got a pretty clear improvement in f-n rate on his hasn't-improved-in-
# a-long-time large c.l.py test by using 0.1. No other values have been
# tried yet.
# Neil Schemenauer also reported good results from 0.1, making the all-
# Robinson scheme match the all-default Graham-like scheme on a smaller
# and different corpus.
# NOTE: Changing this may change the best spam_cutoff value for your
# corpus. Since one effect is to separate the means more, you'll probably
# want a higher spam_cutoff.
robinson_minimum_prob_strength: 0.0