[Spambayes] defaults vs. chi-square
Tim Peters
tim.one@comcast.net
Tue Oct 29 04:55:27 2002
[Tim, claims to have fixed the "plain text follows a base64 section"
decoding glitch]
Just FYI, this had minor good effects on my c.l.py test (10-fold CV):
filename: cv tcap
ham:spam: 20000:14000
20000:14000
fp total: 2 2
fp %: 0.01 0.01
fn total: 0 0
fn %: 0.00 0.00
unsure t: 103 97
unsure %: 0.30 0.29
real cost: $40.60 $39.40
best cost: $27.00 $26.80
h mean: 0.28 0.26
h sdev: 2.99 2.89
s mean: 99.94 99.94
s sdev: 1.41 1.44
mean diff: 99.66 99.68
k: 22.65 23.02
Hmm! That "after" run there also had
replace_nonascii_chars: True
different. Sorry about that; it's not worth it (to me) to separate those
out.
The percentiles for this large-training test have gotten very interesting:
-> <stat> Ham scores for all runs: 20000 items; mean 0.26; sdev 2.89
-> <stat> min 0; median 6.37101e-011; max 100
-> <stat> percentiles: 5% 0; 25% 2.22045e-014; 75% 8.15779e-007; 95%
0.0358985
-> <stat> Spam scores for all runs: 14000 items; mean 99.94; sdev 1.44
-> <stat> min 29.8279; median 100; max 100
-> <stat> percentiles: 5% 100; 25% 100; 75% 100; 95% 100
Histogram analysis still suggests it would be cheaper to let some FN go
through:
-> best cost for all runs: $26.80
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at ham & spam cutoffs 0.5 & 0.775
-> fp 2; fn 3; unsure ham 11; unsure spam 8
-> fp rate 0.01%; fn rate 0.0214%; unsure rate 0.0559%