[Spambayes] chi-squared versus "prob strength"

Sun, 13 Oct 2002 04:13:47 -0400

[Tim]
> Note the default robinson_minimum_prob_strength is still 0.1 ...
> ...
> Rerunning my fat test with this option set to 0.0 (don't ignore
> any words) gave nearly identical final results, but I didn't like
> the fine-grained differences.

[Rob Hooft]
> Here is my cmp run for this. First is with 0.1, second with 0.0.
> Distributions are tighter. Is this due to the fact that we have more
> clues now, so the Chi2 distribution is more decisive?

It's been my belief that bland words are at best worthless as clues, and at
worst actively hurt (experiment:  fiddle your favorite scheme to look *only*
at the bland words; do they have predictive power?).  I think this is one of
the schemes where they hurt, for the reason illustrated by tiny example at
the end of my original post:

"""
>>> from chi2 import showscore as s

>>> s([.2, .8, .9])
P(chisq >=    8.27033 | v=  6) =   0.218959
P(chisq >=    3.87588 | v=  6) =   0.693468
spam prob 0.781040515476
 ham prob 0.306531778646
  S/(S+H) 0.71815043441

>>> s([.2, .8, .9] + [0.5] * 10)
P(chisq >=    22.1333 | v= 26) =   0.681383
P(chisq >=    17.7388 | v= 26) =   0.885068
spam prob 0.318617174026
 ham prob 0.114932197304
  S/(S+H) 0.734904015772

>>>

I can't love that adding a pile of 100% neutral probs intensifies the spam
judgment, and under the covers the effects on S and H are seen to be
dramatic.  Yes, "it's even more not uniformly distributed" after adding in
10 0.5s, but that's really got nothing to do with whether the msg is ham or
spam!
"""

The hypothesis that the spamprobs are uniformly distributed seems irrelevant
to whether a msg is ham or spam, and dumping bland words in acts to reject
the hypothesis for a reason that also has nothing to do with the distinction
we're *trying* to make.  The bland words seem most of all to intensify the
decision the scheme would have made anyway if they weren't included.  That
makes things more extreme, but (IMO) not for a *reasonable* reason.  I think
it's akin to taking scores below 0.1 and dividing them by 2, and taking
scores above 0.9 and adding half their distance to 1:  it makes things more
extreme, but not usefully.  Extremity for extremity's sake is no virtue
<wink>.

> -> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
> [...]
> -> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams
>
> false positive percentages
>      0.062  0.188  lost  +203.23%
>      0.312  0.438  lost   +40.38%
>      0.062  0.125  lost  +101.61%
>      0.062  0.125  lost  +101.61%
>      0.062  0.125  lost  +101.61%
>      0.062  0.062  tied
>      0.250  0.250  tied
>      0.125  0.188  lost   +50.40%
>      0.250  0.312  lost   +24.80%
>      0.000  0.000  tied
>
> won   0 times
> tied  3 times
> lost  7 times
>
> total unique fp went from 20 to 29 lost   +45.00%
> mean fp % went from 0.125 to 0.18125 lost   +45.00%
>
> false negative percentages
>      1.034  1.034  tied
>      0.345  0.345  tied
>      0.517  0.345  won    -33.27%
>      0.517  0.517  tied
>      1.207  1.207  tied
>      0.862  0.690  won    -19.95%
>      0.862  0.690  won    -19.95%
>      0.345  0.345  tied
>      0.517  0.517  tied
>      1.034  0.862  won    -16.63%
>
> won   4 times
> tied  6 times
> lost  0 times
>
> total unique fn went from 42 to 38 won     -9.52%
> mean fn % went from 0.724137931034 to 0.655172413793 won     -9.52%
>
> ham mean                     ham sdev
>     0.52    0.39  -25.00%        4.49    4.46   -0.67%
>     0.72    0.60  -16.67%        6.62    6.59   -0.45%
>     0.63    0.45  -28.57%        4.83    4.42   -8.49%
>     0.60    0.41  -31.67%        4.83    4.51   -6.63%
>     0.52    0.36  -30.77%        4.26    4.06   -4.69%
>     0.43    0.31  -27.91%        4.21    3.82   -9.26%
>     0.64    0.52  -18.75%        5.75    5.72   -0.52%
>     0.68    0.51  -25.00%        5.63    5.39   -4.26%
>     0.70    0.62  -11.43%        5.71    6.13   +7.36%
>     0.41    0.31  -24.39%        3.65    3.24  -11.23%
>
> ham mean and sdev for all runs
>     0.59    0.45  -23.73%        5.07    4.94   -2.56%

Because the ham distrubtion got tighter and closer to 0, you need a larger
spam_cutoff now.  A spam_cutoff too low probably explains both the increase
in FP rate and the decrease in FN rate.

> spam mean                    spam sdev
>    99.20   99.32   +0.12%        6.10    5.77   -5.41%
>    99.70   99.71   +0.01%        3.45    3.80  +10.14%
>    99.55   99.68   +0.13%        3.63    3.23  -11.02%
>    99.38   99.44   +0.06%        6.34    6.27   -1.10%
>    99.14   99.19   +0.05%        7.05    7.05   +0.00%
>    99.40   99.47   +0.07%        4.72    5.24  +11.02%
>    99.42   99.50   +0.08%        5.09    5.10   +0.20%
>    99.41   99.51   +0.10%        4.55    4.99   +9.67%
>    99.48   99.62   +0.14%        3.81    3.20  -16.01%
>    99.31   99.39   +0.08%        6.09    5.97   -1.97%
>
> spam mean and sdev for all runs
>    99.40   99.48   +0.08%        5.22    5.21   -0.19%
>
> ham/spam mean difference: 98.81 99.03 +0.22

I saw the same thing (qualitatively), and it's at least curious:  ham mean
and sdev consistently decrease; spam mean consistently increases but less
so; and effects on spam sdev a mixed bag with almost no net effect when
averaged out.  BTW, with max_discriminators=150, you *may* have many ham
that didn't have 150 unique extreme words, and in that case no longer
ignoring the bland words may have a large effect similar to the one in the
example above.