[Spambayes] Sequemtial Test Results
Tim Peters
tim.one@comcast.net
Sun, 06 Oct 2002 01:00:56 -0400
[Jim Bublitz]
> ...
> Playing around with Spambayes, I get slightly better results if I
> ...
> c) drop robinson_probability_s to .05,
That's a very low value. I find this way of rewriting Gary's adjustment
easier to reason about:
s*x + n*p x - p
--------- = p + -------
s + n 1 + n/s
This makes it clear that it moves p in the direction of x, but less so the
larger n is, or the smaller s is. For you, s=.05, and then that's
x-p
p + ------
1+20*n
At n=1, that's p + (x-p)/21. The *interesting* <wink> thing there is that,
since you said you effectively removed Graham's mincount gimmick, under pure
Graham you *were* getting extreme spamprobs of 0.01 and 0.99 for words that
had been seen only once in the training data. Setting s to 0.05 gives a
very similar effect under Gary's adjustment. If x is 0.5,
0 + .5/21 ~= 0.024
and
1 + -.5/21 ~= 0.976
Those are really extreme probability estimates based on 1 measly occurence
in training data, but perhaps this ties in to the unusual nature of your
data. For example, I've seen that low s helps ham message threads when a
typo or unusual word gets repeated in replies.