[Spambayes] defaults vs. chi-square

Tim Peters tim.one@comcast.net
Tue Oct 29 04:55:27 2002


[Tim, claims to have fixed the "plain text follows a base64 section"
 decoding glitch]

Just FYI, this had minor good effects on my c.l.py test (10-fold CV):

filename:       cv    tcap
ham:spam:  20000:14000
                   20000:14000
fp total:        2       2
fp %:         0.01    0.01
fn total:        0       0
fn %:         0.00    0.00
unsure t:      103      97
unsure %:     0.30    0.29
real cost:  $40.60  $39.40
best cost:  $27.00  $26.80
h mean:       0.28    0.26
h sdev:       2.99    2.89
s mean:      99.94   99.94
s sdev:       1.41    1.44
mean diff:   99.66   99.68
k:           22.65   23.02

Hmm!  That "after" run there also had

    replace_nonascii_chars: True

different.  Sorry about that; it's not worth it (to me) to separate those
out.

The percentiles for this large-training test have gotten very interesting:

-> <stat> Ham scores for all runs: 20000 items; mean 0.26; sdev 2.89
-> <stat> min 0; median 6.37101e-011; max 100
-> <stat> percentiles: 5% 0; 25% 2.22045e-014; 75% 8.15779e-007; 95%
0.0358985

-> <stat> Spam scores for all runs: 14000 items; mean 99.94; sdev 1.44
-> <stat> min 29.8279; median 100; max 100
-> <stat> percentiles: 5% 100; 25% 100; 75% 100; 95% 100

Histogram analysis still suggests it would be cheaper to let some FN go
through:

-> best cost for all runs: $26.80
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at ham & spam cutoffs 0.5 & 0.775
->     fp 2; fn 3; unsure ham 11; unsure spam 8
->     fp rate 0.01%; fn rate 0.0214%; unsure rate 0.0559%