[Spambayes] spamprob combining

Tim Peters tim@zope.com
Tue, 8 Oct 2002 13:02:03 -0400


This is a multi-part message in MIME format.

---------------------- multipart/mixed attachment
The attached sets up an experiment:

    create a vector of 50 "probabilities" at random, uniformly
    distributed in (0.0, 1.0)

    combine them using Paul Graham's scheme, and using Gary
    Robinson's scheme

    record the results

    repeat 5000 times

The results should look familiar for those playing this game from the start:

Result for random vectors of 50 probs, + 0 forced to 0.99

Graham combining 5000 items; mean 0.50; sdev 0.47
-> <stat> min 9.54792e-022; median 0.506715; max 1
* = 35 items
0.00 2051 ***********************************************************
0.05  100 ***
0.10   75 ***
0.15   63 **
0.20   44 **
0.25   35 *
0.30   40 **
0.35   34 *
0.40   30 *
0.45   25 *
0.50   34 *
0.55   32 *
0.60   31 *
0.65   24 *
0.70   39 **
0.75   43 **
0.80   56 **
0.85   55 **
0.90  108 ****
0.95 2081 ************************************************************

Robinson combining 5000 items; mean 0.50; sdev 0.04
-> <stat> min 0.350831; median 0.500083; max 0.649056
* = 34 items
0.00    0
0.05    0
0.10    0
0.15    0
0.20    0
0.25    0
0.30    0
0.35   20 *
0.40  450 **************
0.45 2027 ************************************************************
0.50 2019 ************************************************************
0.55  452 **************
0.60   32 *
0.65    0
0.70    0
0.75    0
0.80    0
0.85    0
0.90    0
0.95    0

IOW, Paul's scheme is almost always "certain" given 50 discriminators, even
in the face of random input.  Gary's is never "certain" then.

OTOH, do the experiment all over again, but attach one prob of 0.99 to each
random vector of 50 probs.  The probs are now systematically biased:

Result for random vectors of 50 probs, + 1 forced to 0.99

Graham combining 5000 items; mean 0.65; sdev 0.45
-> <stat> min 8.36115e-021; median 0.992403; max 1
* = 47 items
0.00 1353 *****************************
0.05   92 **
0.10   50 **
0.15   42 *
0.20   40 *
0.25   35 *
0.30   26 *
0.35   31 *
0.40   32 *
0.45   31 *
0.50   23 *
0.55   29 *
0.60   30 *
0.65   31 *
0.70   45 *
0.75   33 *
0.80   58 **
0.85   84 **
0.90  113 ***
0.95 2822 *************************************************************

Robinson combining 5000 items; mean 0.51; sdev 0.04
-> <stat> min 0.377845; median 0.513446; max 0.637992
* = 42 items
0.00    0
0.05    0
0.10    0
0.15    0
0.20    0
0.25    0
0.30    0
0.35    2 *
0.40  181 *****
0.45 1549 *************************************
0.50 2527 *************************************************************
0.55  698 *****************
0.60   43 **
0.65    0
0.70    0
0.75    0
0.80    0
0.85    0
0.90    0
0.95    0

There's a dramatic difference in the Paul results, while the Gary results
move sublty (in comparison).

If we force 10 additional .99 spamprobs, the differences are night and day:

Result for random vectors of 50 probs, + 10 forced to 0.99

Graham combining 5000 items; mean 1.00; sdev 0.01
-> <stat> min 0.213529; median 1; max 1
* = 82 items
0.00    0
0.05    0
0.10    0
0.15    0
0.20    1 *
0.25    0
0.30    1 *
0.35    0
0.40    0
0.45    0
0.50    0
0.55    0
0.60    0
0.65    0
0.70    0
0.75    0
0.80    0
0.85    0
0.90    0
0.95 4998 *************************************************************

Robinson combining 5000 items; mean 0.59; sdev 0.03
-> <stat> min 0.49794; median 0.58555; max 0.694905
* = 51 items
0.00    0
0.05    0
0.10    0
0.15    0
0.20    0
0.25    0
0.30    0
0.35    0
0.40    0
0.45    2 *
0.50  412 *********
0.55 3068 *************************************************************
0.60 1447 *****************************
0.65   71 **
0.70    0
0.75    0
0.80    0
0.85    0
0.90    0
0.95    0

It's hard to know what to make of this, especially in light of the claim
that Gary-combining has been proven to be the most sensitive possible test
for rejecting the hypothesis that a collection of probs is uniformly
distributed.  At least in this test, Paul-combining seemed far more
sensitive (even when the data is random <wink>).

Intuitively, it *seems* like it would be good to get something not so
insanely sensitive to random input as Paul-combining, but more sensitive to
overwhelming amounts of evidence than Gary-combining.  Even forcing 50
spamprobs of 0.99, the latter only moves up to an average of 0.7:

Result for random vectors of 50 probs, + 50 forced to 0.99

Graham combining 5000 items; mean 1.00; sdev 0.00
-> <stat> min 1; median 1; max 1
* = 82 items
0.00    0
0.05    0
0.10    0
0.15    0
0.20    0
0.25    0
0.30    0
0.35    0
0.40    0
0.45    0
0.50    0
0.55    0
0.60    0
0.65    0
0.70    0
0.75    0
0.80    0
0.85    0
0.90    0
0.95 5000 *************************************************************

Robinson combining 5000 items; mean 0.70; sdev 0.02
-> <stat> min 0.628976; median 0.704543; max 0.810235
* = 45 items
0.00    0
0.05    0
0.10    0
0.15    0
0.20    0
0.25    0
0.30    0
0.35    0
0.40    0
0.45    0
0.50    0
0.55    0
0.60   40 *
0.65 2070 **********************************************
0.70 2743 *************************************************************
0.75  146 ****
0.80    1 *
0.85    0
0.90    0
0.95    0

---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: combine.py
Type: application/octet-stream
Size: 1294 bytes
Desc: not available
Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021008/6dd66b22/combine.exe

---------------------- multipart/mixed attachment--