[Spambayes] chi-squared versus "prob strength"

Rob Hooft rob@hooft.net
Sun, 13 Oct 2002 09:27:11 +0200


Tim Peters wrote:
> Note the default robinson_minimum_prob_strength is still 0.1, meaning that
> we ignore words with spamprobs in 0.4 to 0.6.
> 
> Since the chi-squared test is testing the hypothesis that the probs are
> uniformly distributed, systematically leaving a chunk of probs "out of the
> middle" may bias it.
> 
> Rerunning my fat test with this option set to 0.0 (don't ignore any words)
> gave nearly identical final results, but I didn't like the fine-grained
> differences.  

Here is my cmp run for this. First is with 0.1, second with 0.0. 
Distributions are tighter. Is this due to the fact that we have more 
clues now, so the Chi2 distribution is more decisive?

cv2s -> cv3s
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
[...]
-> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams

false positive percentages
     0.062  0.188  lost  +203.23%
     0.312  0.438  lost   +40.38%
     0.062  0.125  lost  +101.61%
     0.062  0.125  lost  +101.61%
     0.062  0.125  lost  +101.61%
     0.062  0.062  tied
     0.250  0.250  tied
     0.125  0.188  lost   +50.40%
     0.250  0.312  lost   +24.80%
     0.000  0.000  tied

won   0 times
tied  3 times
lost  7 times

total unique fp went from 20 to 29 lost   +45.00%
mean fp % went from 0.125 to 0.18125 lost   +45.00%

false negative percentages
     1.034  1.034  tied
     0.345  0.345  tied
     0.517  0.345  won    -33.27%
     0.517  0.517  tied
     1.207  1.207  tied
     0.862  0.690  won    -19.95%
     0.862  0.690  won    -19.95%
     0.345  0.345  tied
     0.517  0.517  tied
     1.034  0.862  won    -16.63%

won   4 times
tied  6 times
lost  0 times

total unique fn went from 42 to 38 won     -9.52%
mean fn % went from 0.724137931034 to 0.655172413793 won     -9.52%

ham mean                     ham sdev
    0.52    0.39  -25.00%        4.49    4.46   -0.67%
    0.72    0.60  -16.67%        6.62    6.59   -0.45%
    0.63    0.45  -28.57%        4.83    4.42   -8.49%
    0.60    0.41  -31.67%        4.83    4.51   -6.63%
    0.52    0.36  -30.77%        4.26    4.06   -4.69%
    0.43    0.31  -27.91%        4.21    3.82   -9.26%
    0.64    0.52  -18.75%        5.75    5.72   -0.52%
    0.68    0.51  -25.00%        5.63    5.39   -4.26%
    0.70    0.62  -11.43%        5.71    6.13   +7.36%
    0.41    0.31  -24.39%        3.65    3.24  -11.23%

ham mean and sdev for all runs
    0.59    0.45  -23.73%        5.07    4.94   -2.56%

spam mean                    spam sdev
   99.20   99.32   +0.12%        6.10    5.77   -5.41%
   99.70   99.71   +0.01%        3.45    3.80  +10.14%
   99.55   99.68   +0.13%        3.63    3.23  -11.02%
   99.38   99.44   +0.06%        6.34    6.27   -1.10%
   99.14   99.19   +0.05%        7.05    7.05   +0.00%
   99.40   99.47   +0.07%        4.72    5.24  +11.02%
   99.42   99.50   +0.08%        5.09    5.10   +0.20%
   99.41   99.51   +0.10%        4.55    4.99   +9.67%
   99.48   99.62   +0.14%        3.81    3.20  -16.01%
   99.31   99.39   +0.08%        6.09    5.97   -1.97%

spam mean and sdev for all runs
   99.40   99.48   +0.08%        5.22    5.21   -0.19%

ham/spam mean difference: 98.81 99.03 +0.22


-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/