[Spambayes] Central limit
Tim Peters
tim.one@comcast.net
Mon, 30 Sep 2002 23:47:38 -0400
[Tim]
> ...
> I made up a combination of "look at ratios" and "different cutoffs
> for different n" by iteratively staring at the errors and making
> stuff up.
It now appears that the "different cutoffs for different n" was just an
accident based on the specific errors I stared at.
Recall that the "certainty heuristic" was of the form:
ratio = max(abs(zhsam / zspam), abs(zspam / zham))
certain = ratio > cutoff
and then I went on to choose different cutoffs depending on n (n is the
number of "extreme words" found in the msg, with a maximum of 50).
Here's an exhaustive account of all the times the log-central-limit code was
wrong (meaning that abs(zham) < abs(zspam) but the msg was really spam, or
that abs(zspam) < abs(zham) but the msg was really ham). This is segregated
by n (the number of extreme words). For each n, a list of all ratios in the
"but I was wrong" cases is given. The number in square brackets is the
number of predictions made with this specific value of n. The number in
curly braces is the percentage of incorrect predictions. So, for example,
35 times we did a prediction on a msg with 7 extreme words (that's a very
short msg!). Twice the prediction was wrong (5.71% of 35), and in one of
those cases ratio was 1.31, and in the other ratio was 1.72.
3: [36] {0.00%}
4: [21] {0.00%}
5: [14] {0.00%}
6: [22] {0.00%}
7: [35] {5.71%} 1.31 1.72
8: [42] {4.76%} 1.01 1.33
9: [72] {5.56%} 1.00 1.04 1.14 1.28
10: [123] {0.00%}
11: [129] {1.55%} 1.07 1.09
12: [123] {1.63%} 1.05 1.09
13: [131] {0.00%}
14: [169] {0.59%} 1.11
15: [180] {1.11%} 1.18 1.73
16: [232] {1.29%} 1.12 1.12 1.43
17: [315] {1.27%} 1.06 1.06 1.27 1.48
18: [344] {1.16%} 1.28 1.35 1.50 1.60
19: [333] {1.20%} 1.03 1.24 1.75 1.78
20: [375] {0.53%} 1.10 1.12
21: [448] {0.45%} 1.09 2.54
22: [492] {0.00%}
23: [535] {0.56%} 1.38 1.72 2.20
24: [604] {0.50%} 1.03 1.17 1.66
25: [638] {0.63%} 1.04 1.55 1.64 1.85
26: [594] {0.51%} 1.06 1.07 1.13
27: [676] {0.74%} 1.02 1.03 1.06 1.26 1.35
28: [789] {0.00%}
29: [811] {0.49%} 1.03 1.18 1.41 2.24
30: [763] {0.39%} 1.04 1.04 2.08
31: [805] {0.12%} 1.44
32: [787] {0.13%} 1.19
33: [763] {0.26%} 1.10 1.36
34: [764] {0.13%} 1.04
35: [822] {0.12%} 1.03
36: [796] {0.00%}
37: [819] {0.00%}
38: [947] {0.11%} 1.08
39: [907] {0.00%}
40: [873] {0.00%}
41: [877] {0.11%} 1.21
42: [1016] {0.00%}
43: [1005] {0.00%}
44: [1016] {0.00%}
45: [1003] {0.30%} 1.07 1.10 1.27
46: [1068] {0.09%} 1.24
47: [1019] {0.00%}
48: [1026] {0.10%} 1.15
49: [1056] {0.28%} 1.09 1.10 1.24
50: [63585] {0.07%} 1.02 1.02 1.02 1.03 1.03 1.04 1.04 1.04 1.05
1.05 1.05 1.06 1.06 1.08 1.09 1.09 1.09 1.10
1.10 1.11 1.11 1.12 1.13 1.14 1.17 1.17 1.18
1.18 1.18 1.19 1.19 1.19 1.20 1.21 1.25 1.27
1.27 1.29 1.30 1.30 1.40 1.44 1.48 1.52 1.56
1.63
Several things to note:
1. The error rate is generally lower the more words we've got to
work with.
2. There are notable exceptions to that, but error rates are so low
that a single message makes a large difference in error rate.
3. There doesn't appear to be any correlation between n and the
maximum ratio "that works" for that n.
4. 5 predictions (of 90,000) were wrong with a ratio greater than 1.8.
If we were willing to accept half of 1 percent of 1 percent as
an acceptable error rate for "certainty", a fixed cutoff of 1.8
would have caused 5 false negatives (sorry, you can't tell whether
they're f-p or f-n from the above) in the region of certainty,
and no false positives there:
[overall results with a fixed ratio cutoff of 1.8]
for all ham
45000 total
certain 44830 99.622% (|zham| smaller and ratio > 1.8)
wrong 0 0.000%
unsure 170 0.378% (|zham| smaller and ratio <= 1.8)
wrong 37 21.765%
for all spam
45000 total
certain 44563 99.029% (|zspam| smaller and ratio > 1.8)
wrong 5 0.011%
unsure 437 0.971% (|zspam| smaller and ratio <= 1.8)
wrong 79 18.078%