[Spambayes] spamprob combining

Wed, 09 Oct 2002 20:34:15 -0400

[Tim]
> ...
> Intuitively, it *seems* like it would be good to get something not so
> insanely sensitive to random input as Paul-combining, but more
> sensitive to overwhelming amounts of evidence than Gary-combining.

So there's a new option,

[Classifier]
use_tim_combining: True

The comments (from Options.py) explain it:

# For the default scheme, use "tim-combining" of probabilities.  This
# has no effect under the central-limit schemes.  Tim-combining is a
# kind of cross between Paul Graham's and Gary Robinson's combining
# schemes.  Unlike Paul's, it's never crazy-certain, and compared to
# Gary's, in Tim's tests it greatly increased the spread between mean
# ham-scores and spam-scores, while simultaneously decreasing the
# variance of both.  Tim needed a higher spam_cutoff value for best
# results, but spam_cutoff is less touchy than under Gary-combining.
use_tim_combining: False

"Tim combining" simply takes the geometric mean of the spamprobs as a
measure of spamminess S, and the geometric mean of 1-spamprob as a measure
of hamminess H, then returns S/(S+H) as "the score".  This is well-behaved
when fed random, uniformly distributed probabilities, but isn't reluctant to
let an overwhelming number of extreme clues lead it to an extreme conclusion
(although you're not going to see it give Graham-like 1e-30 or
1.0000000000000 scores).

Don't use a central-limit scheme with this (it has no effect on those).  If
you test it, use whatever variations on the "all default" scheme you usually
use, but it will probably help to boost spam_cutoff.  Note that the default
max_discriminators is still 150, and that's what I used below.

Here's a 10-set cross-validation run on my data, restricted to 100 ham and
100 spam per set, with all defaults, except

                    before   after
                    ------   -----
use_tim_combining   False    True
spam_cutoff         0.55     0.615

-> <stat> tested 100 hams & 100 spams against 900 hams & 900 spams
   [ditto 19 times]

false positive percentages
    0.000  0.000  tied
    1.000  0.000  won   -100.00%
    1.000  1.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied

won   1 times
tied  9 times
lost  0 times

total unique fp went from 2 to 1 won    -50.00%
mean fp % went from 0.2 to 0.1 won    -50.00%

false negative percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    1.000  1.000  tied
    0.000  0.000  tied

won   0 times
tied 10 times
lost  0 times

total unique fn went from 1 to 1 tied
mean fn % went from 0.1 to 0.1 tied

The real story here is in the score distributions; contrary to what the
comment said above, the ham-score variance increased with this little data:

ham mean                     ham sdev
  30.63   18.80  -38.62%        6.03    6.83  +13.27%
  29.31   17.35  -40.81%        5.48    6.84  +24.82%
  29.96   18.50  -38.25%        6.95    9.02  +29.78%
  29.66   18.12  -38.91%        5.89    6.81  +15.62%
  29.51   17.34  -41.24%        5.73    6.71  +17.10%
  29.40   17.43  -40.71%        5.73    6.61  +15.36%
  29.75   17.74  -40.37%        5.76    6.96  +20.83%
  29.71   18.17  -38.84%        5.97    6.48   +8.54%
  31.98   20.41  -36.18%        5.96    8.02  +34.56%
  29.83   18.11  -39.29%        4.75    5.41  +13.89%

ham mean and sdev for all runs
  29.97   18.20  -39.27%        5.90    7.08  +20.00%

spam mean                    spam sdev
  79.23   88.38  +11.55%        6.96    5.52  -20.69%
  79.40   88.70  +11.71%        7.00    5.64  -19.43%
  78.68   88.06  +11.92%        6.69    5.13  -23.32%
  79.65   89.01  +11.75%        7.20    5.22  -27.50%
  79.91   88.87  +11.21%        6.35    4.67  -26.46%
  80.47   89.16  +10.80%        7.22    6.06  -16.07%
  80.94   89.78  +10.92%        6.60    4.45  -32.58%
  80.30   89.41  +11.34%        6.95    5.49  -21.01%
  78.54   87.70  +11.66%        7.30    6.45  -11.64%
  80.06   89.06  +11.24%        6.98    5.43  -22.21%

spam mean and sdev for all runs
  79.72   88.81  +11.40%        6.97    5.47  -21.52%

ham/spam mean difference: 49.75 70.61 +20.86

So before, the score equidistant from both means was 52.78, at 3.87 sdevs
from each; after, it was 58.03, at 5.63 sdevs from each.  The populations
are much better separated by this measure.

Histograms before:

-> <stat> Ham scores for all runs: 1000 items; mean 29.97; sdev 5.90
-> <stat> min 13.521; median 29.6919; max 60.8937
* = 2 items
...
 13  2 *
 14  0
 15  2 *
 16  8 ****
 17  4 **
 18  9 *****
 19 17 *********
 20 14 *******
 21 16 ********
 22 24 ************
 23 38 *******************
 24 47 ************************
 25 62 *******************************
 26 65 *********************************
 27 69 ***********************************
 28 73 *************************************
 29 70 ***********************************
 30 76 **************************************
 31 70 ***********************************
 32 61 *******************************
 33 51 **************************
 34 50 *************************
 35 34 *****************
 36 30 ***************
 37 27 **************
 38 18 *********
 39 12 ******
 40 11 ******
 41 13 *******
 42  2 *
 43  5 ***
 44  8 ****
 45  2 *
 46  1 *
 47  3 **
 48  1 *
 49  0
 50  3 **
 51  0
 52  0
 53  0
 54  0
 55  1 *
 56  0
 57  0
 58  0
 59  0
 60  1 *
...

-> <stat> Spam scores for all runs: 1000 items; mean 79.72; sdev 6.97
-> <stat> min 52.3428; median 79.9799; max 98.1879
* = 2 items
...
 52  1 *
 53  0
 54  0
 55  0
 56  3 **
 57  1 *
 58  0
 59  1 *
 60  4 **
 61  4 **
 62  4 **
 63  3 **
 64  4 **
 65  7 ****
 66  9 *****
 67 10 *****
 68 13 *******
 69 16 ********
 70 26 *************
 71 18 *********
 72 29 ***************
 73 35 ******************
 74 40 ********************
 75 39 ********************
 76 56 ****************************
 77 52 **************************
 78 50 *************************
 79 76 **************************************
 80 60 ******************************
 81 77 ***************************************
 82 45 ***********************
 83 61 *******************************
 84 50 *************************
 85 43 **********************
 86 41 *********************
 87 33 *****************
 88 19 **********
 89 11 ******
 90 11 ******
 91  8 ****
 92  2 *
 93  9 *****
 94  4 **
 95  9 *****
 96  2 *
 97 11 ******
 98  3 **
 99  0

Histograms after:

-> <stat> Ham scores for all runs: 1000 items; mean 18.20; sdev 7.08
-> <stat> min 5.6946; median 17.1757; max 73.1302
* = 2 items
...
  5  1 *
  6 13 *******
  7 16 ********
  8 25 *************
  9 22 ***********
 10 37 *******************
 11 45 ***********************
 12 56 ****************************
 13 70 ***********************************
 14 61 *******************************
 15 66 *********************************
 16 79 ****************************************
 17 63 ********************************
 18 59 ******************************
 19 59 ******************************
 20 56 ****************************
 21 47 ************************
 22 36 ******************
 23 37 *******************
 24 32 ****************
 25  9 *****
 26 20 **********
 27 17 *********
 28  8 ****
 29  7 ****
 30 11 ******
 31  6 ***
 32  7 ****
 33  5 ***
 34  4 **
 35  2 *
 36  2 *
 37  6 ***
 38  1 *
 39  0
 40  3 **
 41  3 **
 42  0
 43  1 *
 44  1 *
 45  1 *
 46  0
 47  1 *
 48  0
 49  0
 50  2 *
 51  1 *
 52  0
 53  0
 54  0
 55  0
 56  0
 57  0
 58  0
 59  0
 60  0
 61  1 *
 62  0
 63  0
 64  0
 65  0
 66  0
 67  0
 68  0
 69  0
 70  0
 71  0
 72  0
 73  1 *

-> <stat> Spam scores for all runs: 1000 items; mean 88.81; sdev 5.47
-> <stat> min 54.9382; median 89.5188; max 98.3805
* = 2 items
...
 54   1 *
 55   0
 56   0
 57   0
 58   0
 59   0
 60   0
 61   0
 62   0
 63   1 *
 64   3 **
 65   0
 66   1 *
 67   0
 68   2 *
 69   2 *
 70   3 **
 71   3 **
 72   2 *
 73   2 *
 74   4 **
 75   4 **
 76   6 ***
 77   8 ****
 78   8 ****
 79   6 ***
 80  12 ******
 81  25 *************
 82  26 *************
 83  25 *************
 84  39 ********************
 85  58 *****************************
 86  70 ***********************************
 87  64 ********************************
 88  74 *************************************
 89 106 *****************************************************
 90  85 *******************************************
 91  62 *******************************
 92  86 *******************************************
 93  79 ****************************************
 94  37 *******************
 95  23 ************
 96  42 *********************
 97  25 *************
 98   6 ***
 99   0

There are snaky tails in either case, but "the middle ground" here is
larger, sparser, and still contains the errors.

Across my full test data, which I actually ran first, you can ignore the
"won/lost" business; I had spam_cutoff at 0.55 for both runs, and the
overall results would have been virtually identical had I boosted
spam_cutoff in the second run (recall that I can't demonstrate an
improvement on this data anymore!  I can only determine whether something is
a disaster, and this ain't).

-> <stat> tested 2000 hams & 1400 spams against 18000 hams & 12600 spams
   [ditto 19 times]
...
false positive percentages
    0.000  0.050  lost  +(was 0)
    0.000  0.050  lost  +(was 0)
    0.000  0.050  lost  +(was 0)
    0.000  0.000  tied
    0.050  0.100  lost  +100.00%
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.050  0.050  tied

won   0 times
tied  6 times
lost  4 times

total unique fp went from 2 to 6 lost  +200.00%
mean fp % went from 0.01 to 0.03 lost  +200.00%

false negative percentages
    0.000  0.000  tied
    0.071  0.071  tied
    0.000  0.000  tied
    0.071  0.071  tied
    0.143  0.071  won    -50.35%
    0.143  0.000  won   -100.00%
    0.143  0.143  tied
    0.143  0.000  won   -100.00%
    0.071  0.000  won   -100.00%
    0.000  0.000  tied

won   4 times
tied  6 times
lost  0 times

total unique fn went from 11 to 5 won    -54.55%
mean fn % went from 0.0785714285714 to 0.0357142857143 won    -54.55%

ham mean                     ham sdev
  25.65   10.68  -58.36%        5.67    5.44   -4.06%
  25.61   10.68  -58.30%        5.50    5.29   -3.82%
  25.57   10.68  -58.23%        5.67    5.49   -3.17%
  25.66   10.71  -58.26%        5.54    5.27   -4.87%
  25.42   10.55  -58.50%        5.72    5.71   -0.17%
  25.51   10.43  -59.11%        5.39    5.11   -5.19%
  25.65   10.40  -59.45%        5.59    5.29   -5.37%
  25.61   10.51  -58.96%        5.41    5.21   -3.70%
  25.84   10.80  -58.20%        5.48    5.30   -3.28%
  25.81   10.85  -57.96%        5.81    5.73   -1.38%

ham mean and sdev for all runs
  25.63   10.63  -58.53%        5.58    5.39   -3.41%

spam mean                    spam sdev
  83.86   93.17  +11.10%        7.09    4.55  -35.83%
  83.64   93.16  +11.38%        6.83    4.52  -33.82%
  83.27   92.91  +11.58%        6.81    4.52  -33.63%
  83.82   93.14  +11.12%        6.88    4.67  -32.12%
  83.89   93.29  +11.21%        6.65    4.56  -31.43%
  83.78   93.11  +11.14%        6.96    4.72  -32.18%
  83.42   93.00  +11.48%        6.82    4.74  -30.50%
  83.86   93.29  +11.24%        6.71    4.55  -32.19%
  83.88   93.22  +11.13%        6.98    4.71  -32.52%
  83.75   93.28  +11.38%        6.65    4.32  -35.04%

spam mean and sdev for all runs
  83.72   93.16  +11.28%        6.84    4.59  -32.89%

ham/spam mean difference: 58.09 82.53 +24.44

So the equidistant score changed from 51.73 at 4.68 sdevs from each mean, to
55.20 at 8.27 sdevs from each.  That's big.

The "after" histograms had 200 buckets in this run:

-> <stat> Ham scores for all runs: 20000 items; mean 10.63; sdev 5.39
-> <stat> min 0.281945; median 9.69929; max 81.9673
* = 17 items
 0.0   7 *
 0.5  13 *
 1.0  21 **
 1.5  41 ***
 2.0  86 ******
 2.5 166 **********
 3.0 239 ***************
 3.5 326 ********************
 4.0 466 ****************************
 4.5 554 *********************************
 5.0 642 **************************************
 5.5 701 ******************************************
 6.0 793 ***********************************************
 6.5 804 ************************************************
 7.0 933 *******************************************************
 7.5 972 **********************************************************
 8.0 997 ***********************************************************
 8.5 934 *******************************************************
 9.0 947 ********************************************************
 9.5 939 ********************************************************
10.0 839 **************************************************
10.5 786 ***********************************************
11.0 752 *********************************************
11.5 760 *********************************************
12.0 636 **************************************
12.5 606 ************************************
13.0 554 *********************************
13.5 483 *****************************
14.0 461 ****************************
14.5 399 ************************
15.0 360 **********************
15.5 317 *******************
16.0 275 *****************
16.5 224 **************
17.0 193 ************
17.5 169 **********
18.0 172 ***********
18.5 154 **********
19.0 153 *********
19.5  92 ******
20.0 104 *******
20.5  99 ******
21.0  74 *****
21.5  73 *****
22.0  73 *****
22.5  50 ***
23.0  38 ***
23.5  50 ***
24.0  38 ***
24.5  34 **
25.0  26 **
25.5  39 ***
26.0  24 **
26.5  34 **
27.0  18 **
27.5  15 *
28.0  20 **
28.5  15 *
29.0  14 *
29.5  15 *
30.0  12 *
30.5  15 *
31.0  14 *
31.5  10 *
32.0  12 *
32.5   6 *
33.0  10 *
33.5   4 *
34.0   8 *
34.5   5 *
35.0   5 *
35.5   6 *
36.0   7 *
36.5   4 *
37.0   2 *
37.5   3 *
38.0   1 *
38.5   4 *
39.0   6 *
39.5   2 *
40.0   2 *
40.5   5 *
41.0   0
41.5   2 *
42.0   3 *
42.5   3 *
43.0   1 *
43.5   2 *
44.0   1 *
44.5   2 *
45.0   1 *
45.5   1 *
46.0   2 *
46.5   0
47.0   3 *
47.5   0
48.0   1 *
48.5   1 *
49.0   1 *
49.5   0
50.0   1 *
50.5   0
51.0   2 *
51.5   0
52.0   1 *
52.5   0
53.0   0
53.5   1 *
54.0   1 *
54.5   2 *
55.0   0
55.5   0
56.0   1 *
56.5   1 *
57.0   0
57.5   0
58.0   0
58.5   1 *
59.0   0
59.5   0
60.0   0
60.5   0
61.0   1 *
61.5   0
62.0   0
62.5   0
63.0   0
63.5   0
64.0   0
64.5   0
65.0   0
65.5   0
66.0   0
66.5   0
67.0   0
67.5   0
68.0   0
68.5   0
69.0   0
69.5   0
70.0   1 *  the lady with the long & obnoxious employer-generated sig
70.5   0
71.0   0
71.5   0
72.0   0
72.5   0
73.0   0
73.5   0
74.0   0
74.5   0
75.0   0
75.5   0
76.0   0
76.5   0
77.0   0
77.5   0
78.0   0
78.5   0
79.0   0
79.5   0
80.0   0
80.5   0
81.0   0
81.5   1 *  the verbatim quote of a long Nigerian-scam spam
...

-> <stat> Spam scores for all runs: 14000 items; mean 93.16; sdev 4.59
-> <stat> min 24.3497; median 93.8141; max 99.6769
* = 15 items
...
24.0   1 *  not really sure -- it's a giant base64-encoded plain text file
24.5   0
25.0   0
25.5   0
26.0   0
26.5   0
27.0   0
27.5   0
28.0   0
28.5   0
29.0   1 *  the spam with the uuencoded body we throw away
29.5   0
30.0   0
30.5   0
31.0   0
31.5   0
32.0   0
32.5   0
33.0   0
33.5   0
34.0   0
34.5   0
35.0   0
35.5   0
36.0   0
36.5   0
37.0   0
37.5   0
38.0   0
38.5   0
39.0   0
39.5   0
40.0   0
40.5   0
41.0   0
41.5   0
42.0   0
42.5   0
43.0   0
43.5   0
44.0   0
44.5   0
45.0   0
45.5   0
46.0   1 *  Hello, my Name is BlackIntrepid
46.5   0
47.0   0
47.5   0
48.0   0
48.5   0
49.0   0
49.5   0
50.0   0
50.5   0
51.0   0
51.5   0
52.0   0
52.5   0
53.0   0
53.5   1 *  unclear; a collection of webmaster links
54.0   1 *  Susan makes a propsal (sic) to Tim
54.5   0
55.0   1 *
55.5   0
56.0   0
56.5   1 *
57.0   2 *
57.5   0
58.0   0
58.5   1 *
59.0   0
59.5   0
60.0   1 *
60.5   2 *
61.0   1 *
61.5   1 *
62.0   0
62.5   1 *
63.0   1 *
63.5   0
64.0   1 *
64.5   1 *
65.0   0
65.5   1 *
66.0   1 *
66.5   2 *
67.0   4 *
67.5   2 *
68.0   0
68.5   1 *
69.0   0
69.5   3 *
70.0   1 *
70.5   5 *
71.0   5 *
71.5   3 *
72.0   4 *
72.5   3 *
73.0   3 *
73.5   6 *
74.0   3 *
74.5   4 *
75.0   8 *
75.5   8 *
76.0  10 *
76.5  10 *
77.0  10 *
77.5  17 **
78.0  14 *
78.5  27 **
79.0  16 **
79.5  23 **
80.0  28 **
80.5  29 **
81.0  37 ***
81.5  37 ***
82.0  46 ****
82.5  55 ****
83.0  47 ****
83.5  53 ****
84.0  58 ****
84.5  68 *****
85.0  86 ******
85.5 118 ********
86.0 135 *********
86.5 159 ***********
87.0 165 ***********
87.5 178 ************
88.0 209 **************
88.5 231 ****************
89.0 299 ********************
89.5 391 ***************************
90.0 425 *****************************
90.5 402 ***************************
91.0 501 **********************************
91.5 582 ***************************************
92.0 636 *******************************************
92.5 667 *********************************************
93.0 713 ************************************************
93.5 685 **********************************************
94.0 610 *****************************************
94.5 621 ******************************************
95.0 721 *************************************************
95.5 735 *************************************************
96.0 870 **********************************************************
96.5 742 **************************************************
97.0 449 ******************************
97.5 447 ******************************
98.0 556 **************************************
98.5 561 **************************************
99.0 264 ******************
99.5 171 ************

The mistakes are all familiar; the good news is that "the normal cases" are
far removed from what might plausibly be called a middle ground.  For
example, if we called the region from 40 thru 70 here "the middle ground",
and kicked those out for manual review, there would be very few msgs to
review, but they would contain almost all the mistakes.

How does this do on your data?  I'm in favor what works <wink>.