[Spambayes] Proposing to remove 4 combining schemes
Rob W. W. Hooft
rob@hooft.net
Thu Oct 17 16:25:54 2002
I wrote about the huge certainties in chi2 combining:
>>You can downscale things a bit by reducing the final S,H-score in
>>chi_squared combining before calling chi2Q. Maybe take the sqrt or
>>something similar.
>
Tim wrote:
>
> Not really attractive; sqrt would be far too gross a distortion, btw (e.g.,
> it would change a score of 0.5 to 0.0 -- the mean is 2*n and the sdev
> 2*sqrt(n)).
I tried it anyway. Here are some results:
Normal:
-> <stat> Ham scores for all runs: 16000 items; mean 0.59; sdev 4.96
-> <stat> min 0; median 1.36141e-11; max 100
-> <stat> fivepctlo 0; fivepcthi 0.144228
* = 253 items
0.0 15415 *************************************************************
0.5 84 *
1.0 54 *
1.5 30 *
2.0 30 *
2.5 17 *
3.0 19 *
3.5 19 *
4.0 12 *
-> <stat> Spam scores for all runs: 5800 items; mean 99.02; sdev 5.86
-> <stat> min 6.85475e-09; median 100; max 100
-> <stat> fivepctlo 96.8278; fivepcthi 100
* = 87 items
95.5 46 *
96.0 17 *
96.5 14 *
97.0 16 *
97.5 21 *
98.0 38 *
98.5 35 *
99.0 92 **
99.5 5300 *************************************************************
-> best cost for all runs: $102.60
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at ham & spam cutoffs 0.495 & 0.96
-> fp 3; fn 14; unsure ham 40; unsure spam 253
-> fp rate 0.0187%; fn rate 0.241%; unsure rate 1.34%
==================
Dividing the log-products and n by 2:
-> <stat> Ham scores for all runs: 16000 items; mean 0.76; sdev 5.07
-> <stat> min 0; median 1.19013e-05; max 99.9998
-> <stat> fivepctlo 0; fivepcthi 1.54439
* = 242 items
0.0 14736 *************************************************************
0.5 316 **
1.0 134 *
1.5 103 *
2.0 74 *
2.5 60 *
3.0 37 *
3.5 35 *
4.0 34 *
-> <stat> Spam scores for all runs: 5800 items; mean 98.71; sdev 5.97
-> <stat> min 0.000221093; median 100; max 100
-> <stat> fivepctlo 92.9253; fivepcthi 100
* = 83 items
95.5 27 *
96.0 21 *
96.5 35 *
97.0 38 *
97.5 40 *
98.0 59 *
98.5 82 *
99.0 122 **
99.5 5005 *************************************************************
-> best cost for all runs: $104.40
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at ham & spam cutoffs 0.49 & 0.92
-> fp 3; fn 14; unsure ham 43; unsure spam 259
-> fp rate 0.0187%; fn rate 0.241%; unsure rate 1.39%
=============================================
Dividing the log-products and n by 4:
-> <stat> Ham scores for all runs: 16000 items; mean 1.32; sdev 5.49
-> <stat> min 0; median 0.0140483; max 99.9378
-> <stat> fivepctlo 1.11022e-14; fivepcthi 6.09162
* = 206 items
0.0 12557 *************************************************************
0.5 880 *****
1.0 511 ***
1.5 298 **
2.0 223 **
2.5 176 *
3.0 135 *
3.5 113 *
4.0 91 *
-> <stat> min 0.0626454; median 99.9953; max 100
-> <stat> fivepctlo 87.8576; fivepcthi 100
* = 71 items
95.5 38 *
96.0 54 *
96.5 55 *
97.0 59 *
97.5 70 *
98.0 150 ***
98.5 142 **
99.0 280 ****
99.5 4331 *************************************************************
-> best cost for all runs: $108.20
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at 2 cutoff pairs
-> smallest ham & spam cutoffs 0.48 & 0.855
-> fp 4; fn 13; unsure ham 46; unsure spam 230
-> fp rate 0.025%; fn rate 0.224%; unsure rate 1.27%
-> largest ham & spam cutoffs 0.485 & 0.855
-> fp 4; fn 14; unsure ham 42; unsure spam 229
-> fp rate 0.025%; fn rate 0.241%; unsure rate 1.24%
As I expected, this significantly broadens the extremes at only very
little cost. What this does statistically is downweighting all clues
thereby taking care of a "standard" correlation between clues. This may
be functionally equivalent to raising the value of s.
This is the /4 code for reference:
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.38
diff -u -r1.38 classifier.py
--- classifier.py 14 Oct 2002 02:20:35 -0000 1.38
+++ classifier.py 17 Oct 2002 15:24:55 -0000
@@ -516,7 +516,10 @@
S = ln(S) + Sexp * LN2
H = ln(H) + Hexp * LN2
- n = len(clues)
+ S = S/4.0
+ H = H/4.0
+
+ n = len(clues)//4
if n:
S = 1.0 - chi2Q(-2.0 * S, 2*n)
H = 1.0 - chi2Q(-2.0 * H, 2*n)
Rob
--
Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/