[Spambayes] chi-squared versus "prob strength"
Rob Hooft
rob@hooft.net
Sun, 13 Oct 2002 09:27:11 +0200
Tim Peters wrote:
> Note the default robinson_minimum_prob_strength is still 0.1, meaning that
> we ignore words with spamprobs in 0.4 to 0.6.
>
> Since the chi-squared test is testing the hypothesis that the probs are
> uniformly distributed, systematically leaving a chunk of probs "out of the
> middle" may bias it.
>
> Rerunning my fat test with this option set to 0.0 (don't ignore any words)
> gave nearly identical final results, but I didn't like the fine-grained
> differences.
Here is my cmp run for this. First is with 0.1, second with 0.0.
Distributions are tighter. Is this due to the fact that we have more
clues now, so the Chi2 distribution is more decisive?
cv2s -> cv3s
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
[...]
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
false positive percentages
0.062 0.188 lost +203.23%
0.312 0.438 lost +40.38%
0.062 0.125 lost +101.61%
0.062 0.125 lost +101.61%
0.062 0.125 lost +101.61%
0.062 0.062 tied
0.250 0.250 tied
0.125 0.188 lost +50.40%
0.250 0.312 lost +24.80%
0.000 0.000 tied
won 0 times
tied 3 times
lost 7 times
total unique fp went from 20 to 29 lost +45.00%
mean fp % went from 0.125 to 0.18125 lost +45.00%
false negative percentages
1.034 1.034 tied
0.345 0.345 tied
0.517 0.345 won -33.27%
0.517 0.517 tied
1.207 1.207 tied
0.862 0.690 won -19.95%
0.862 0.690 won -19.95%
0.345 0.345 tied
0.517 0.517 tied
1.034 0.862 won -16.63%
won 4 times
tied 6 times
lost 0 times
total unique fn went from 42 to 38 won -9.52%
mean fn % went from 0.724137931034 to 0.655172413793 won -9.52%
ham mean ham sdev
0.52 0.39 -25.00% 4.49 4.46 -0.67%
0.72 0.60 -16.67% 6.62 6.59 -0.45%
0.63 0.45 -28.57% 4.83 4.42 -8.49%
0.60 0.41 -31.67% 4.83 4.51 -6.63%
0.52 0.36 -30.77% 4.26 4.06 -4.69%
0.43 0.31 -27.91% 4.21 3.82 -9.26%
0.64 0.52 -18.75% 5.75 5.72 -0.52%
0.68 0.51 -25.00% 5.63 5.39 -4.26%
0.70 0.62 -11.43% 5.71 6.13 +7.36%
0.41 0.31 -24.39% 3.65 3.24 -11.23%
ham mean and sdev for all runs
0.59 0.45 -23.73% 5.07 4.94 -2.56%
spam mean spam sdev
99.20 99.32 +0.12% 6.10 5.77 -5.41%
99.70 99.71 +0.01% 3.45 3.80 +10.14%
99.55 99.68 +0.13% 3.63 3.23 -11.02%
99.38 99.44 +0.06% 6.34 6.27 -1.10%
99.14 99.19 +0.05% 7.05 7.05 +0.00%
99.40 99.47 +0.07% 4.72 5.24 +11.02%
99.42 99.50 +0.08% 5.09 5.10 +0.20%
99.41 99.51 +0.10% 4.55 4.99 +9.67%
99.48 99.62 +0.14% 3.81 3.20 -16.01%
99.31 99.39 +0.08% 6.09 5.97 -1.97%
spam mean and sdev for all runs
99.40 99.48 +0.08% 5.22 5.21 -0.19%
ham/spam mean difference: 98.81 99.03 +0.22
--
Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/