[Spambayes] Use for gray area in scoring range

Tim Peters tim.one@comcast.net
Sat, 21 Sep 2002 23:28:13 -0400


[Guido]
> Yesterday I said I didn't think there was much use for a gray area in
> the scoring range, where messages scoring somewhere in the middle
> would have to be manually inspected.  I think I'm going to revert that
> opinion, at least with the Gary Robinson formula.

That's mostly because it *has* a "middle ground" == a relatively small range
where a significant number of false negatives and false positives actually
live.  This wasn't true of the #s produced by the Graham combination
scheme -- numbers coming out of that were almost always very near 0 or very
near 1.  In contrast, Gary's combination scheme almost never produces
numbers near the endpoints.

> It would be handy to have this if you're trying to tweak the cutoff.
> You don't want the cutoff too low, or it'll give you too many false
> negatives.  But even stronger you don't want it too high, or it'll
> give too many false positives.

You've got those backwards, but I know what you mean <wnk>.

> Specifying a gray area gives you a useful tool to see if you have
> set your cutoff right.

The practical difficulty here is that the gray area we're observing is very
narrow:  the difference between setting the cutoff at 0.5 or 0.575 in my
large test was the difference between seeming disaster and "does just as
well as our Graham-like scheme".  Move it to 0.60, and it heads back to
disasterland again.  It's great to have the knob, but it's sensitive, and so
far we've no idea how to choose it short of trial and error (it's easy to
choose if you've got the score histograms to stare at, but end users won't).