[Spambayes] Central limit

Mon, 30 Sep 2002 23:47:38 -0400

[Tim]
> ...
> I made up a combination of "look at ratios" and "different cutoffs
> for different n" by iteratively staring at the errors and making
> stuff up.

It now appears that the "different cutoffs for different n" was just an
accident based on the specific errors I stared at.

Recall that the "certainty heuristic" was of the form:

    ratio = max(abs(zhsam / zspam), abs(zspam / zham))
    certain = ratio > cutoff

and then I went on to choose different cutoffs depending on n (n is the
number of "extreme words" found in the msg, with a maximum of 50).

Here's an exhaustive account of all the times the log-central-limit code was
wrong (meaning that abs(zham) < abs(zspam) but the msg was really spam, or
that abs(zspam) < abs(zham) but the msg was really ham).  This is segregated
by n (the number of extreme words).  For each n, a list of all ratios in the
"but I was wrong" cases is given.  The number in square brackets is the
number of predictions made with this specific value of n.  The number in
curly braces is the percentage of incorrect predictions.  So, for example,
35 times we did a prediction on a msg with 7 extreme words (that's a very
short msg!).  Twice the prediction was wrong (5.71% of 35), and in one of
those cases ratio was 1.31, and in the other ratio was 1.72.

 3: [36] {0.00%}
 4: [21] {0.00%}
 5: [14] {0.00%}
 6: [22] {0.00%}
 7: [35] {5.71%} 1.31 1.72
 8: [42] {4.76%} 1.01 1.33
 9: [72] {5.56%} 1.00 1.04 1.14 1.28
10: [123] {0.00%}
11: [129] {1.55%} 1.07 1.09
12: [123] {1.63%} 1.05 1.09
13: [131] {0.00%}
14: [169] {0.59%} 1.11
15: [180] {1.11%} 1.18 1.73
16: [232] {1.29%} 1.12 1.12 1.43
17: [315] {1.27%} 1.06 1.06 1.27 1.48
18: [344] {1.16%} 1.28 1.35 1.50 1.60
19: [333] {1.20%} 1.03 1.24 1.75 1.78
20: [375] {0.53%} 1.10 1.12
21: [448] {0.45%} 1.09 2.54
22: [492] {0.00%}
23: [535] {0.56%} 1.38 1.72 2.20
24: [604] {0.50%} 1.03 1.17 1.66
25: [638] {0.63%} 1.04 1.55 1.64 1.85
26: [594] {0.51%} 1.06 1.07 1.13
27: [676] {0.74%} 1.02 1.03 1.06 1.26 1.35
28: [789] {0.00%}
29: [811] {0.49%} 1.03 1.18 1.41 2.24
30: [763] {0.39%} 1.04 1.04 2.08
31: [805] {0.12%} 1.44
32: [787] {0.13%} 1.19
33: [763] {0.26%} 1.10 1.36
34: [764] {0.13%} 1.04
35: [822] {0.12%} 1.03
36: [796] {0.00%}
37: [819] {0.00%}
38: [947] {0.11%} 1.08
39: [907] {0.00%}
40: [873] {0.00%}
41: [877] {0.11%} 1.21
42: [1016] {0.00%}
43: [1005] {0.00%}
44: [1016] {0.00%}
45: [1003] {0.30%} 1.07 1.10 1.27
46: [1068] {0.09%} 1.24
47: [1019] {0.00%}
48: [1026] {0.10%} 1.15
49: [1056] {0.28%} 1.09 1.10 1.24
50: [63585] {0.07%} 1.02 1.02 1.02 1.03 1.03 1.04 1.04 1.04 1.05
                    1.05 1.05 1.06 1.06 1.08 1.09 1.09 1.09 1.10
                    1.10 1.11 1.11 1.12 1.13 1.14 1.17 1.17 1.18
                    1.18 1.18 1.19 1.19 1.19 1.20 1.21 1.25 1.27
                    1.27 1.29 1.30 1.30 1.40 1.44 1.48 1.52 1.56
                    1.63

Several things to note:

1. The error rate is generally lower the more words we've got to
   work with.

2. There are notable exceptions to that, but error rates are so low
   that a single message makes a large difference in error rate.

3. There doesn't appear to be any correlation between n and the
   maximum ratio "that works" for that n.

4. 5 predictions (of 90,000) were wrong with a ratio greater than 1.8.
   If we were willing to accept half of 1 percent of 1 percent as
   an acceptable error rate for "certainty", a fixed cutoff of 1.8
   would have caused 5 false negatives (sorry, you can't tell whether
   they're f-p or f-n from the above) in the region of certainty,
   and no false positives there:

[overall results with a fixed ratio cutoff of 1.8]

for all ham
    45000 total
    certain    44830 99.622% (|zham| smaller and ratio > 1.8)
        wrong      0  0.000%
    unsure       170  0.378% (|zham| smaller and ratio <= 1.8)
        wrong     37 21.765%

for all spam
    45000 total
    certain    44563 99.029% (|zspam| smaller and ratio > 1.8)
        wrong      5  0.011%
    unsure       437  0.971% (|zspam| smaller and ratio <= 1.8)
        wrong     79 18.078%