[Spambayes] Total cost analysis

Mon, 14 Oct 2002 13:34:39 -0400

In order to ease "middle ground" testing, I redid the automatic histogram
analysis to do total-cost minimization similar to that done by Rob's
cvcost.py.

Here's highly atypical sample output.  It's from a tiny run so that you can
see by eyeball what it means:

-> <stat> Ham scores for this pair: 10 items; mean 1.04; sdev 1.21
-> <stat> min 0.000428085; median 0.45401; max 3.12227
* = 1 items
 0.0 5 *****
 0.5 2 **
 1.0 0
 1.5 0
 2.0 1 *
 2.5 0
 3.0 2 **
 3.5 0
...

-> <stat> Spam scores for this pair: 10 items; mean 100.00; sdev 0.00
-> <stat> min 100; median 100; max 100
* = 1 items
...
99.0  0
99.5 10 **********
-> best cost $0.00
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at 18721 cutoff pairs
-> smallest ham & spam cutoffs 0.035 & 0.035
->     fp 0; fn 0; unsure ham 0; unsure spam 0
->     fp rate 0%; fn rate 0%
-> largest ham & spam cutoffs 0.995 & 0.995
->     fp 0; fn 0; unsure ham 0; unsure spam 0
->     fp rate 0%; fn rate 0%

This is trivial because no "middle ground" is needed here:  calling
everything >= 0.035 spam works exactly as well as calling everything >=
0.995 spam, and there are no mistakes or unsures in either case.

Less trivial, because the ham scores slobber all over the range:

-> <stat> Ham scores for all runs: 100 items; mean 8.09; sdev 17.25
-> <stat> min 3.24153e-007; median 0.846144; max 97.6463
* = 1 items
 0.0 44 ********************************************
 0.5  8 ********
 1.0  5 *****
 1.5  5 *****
 2.0  4 ****
 2.5  0
 3.0  5 *****
 3.5  2 **
 4.0  1 *
 4.5  0
 5.0  1 *
 5.5  1 *
 6.0  0
 6.5  2 **
 7.0  2 **
 7.5  1 *
 8.0  0
 8.5  0
 9.0  0
 9.5  0
10.0  1 *
10.5  0
11.0  0
11.5  0
12.0  0
12.5  0
13.0  0
13.5  0
14.0  1 *
14.5  0
15.0  0
15.5  1 *
16.0  1 *
16.5  0
17.0  0
17.5  0
18.0  0
18.5  0
19.0  0
19.5  0
20.0  0
20.5  1 *
21.0  2 **
21.5  0
22.0  1 *
22.5  0
23.0  0
23.5  0
24.0  0
24.5  0
25.0  0
25.5  0
26.0  0
26.5  0
27.0  0
27.5  0
28.0  1 *
28.5  0
29.0  0
29.5  0
30.0  1 *
30.5  1 *
31.0  0
31.5  0
32.0  1 *
32.5  0
33.0  0
33.5  0
34.0  0
34.5  0
35.0  0
35.5  0
36.0  0
36.5  0
37.0  0
37.5  0
38.0  0
38.5  0
39.0  0
39.5  0
40.0  0
40.5  0
41.0  0
41.5  0
42.0  0
42.5  0
43.0  0
43.5  0
44.0  1 *
44.5  0
45.0  0
45.5  0
46.0  1 *
46.5  0
47.0  0
47.5  0
48.0  0
48.5  0
49.0  0
49.5  0
50.0  0
50.5  0
51.0  0
51.5  0
52.0  0
52.5  0
53.0  0
53.5  0
54.0  0
54.5  0
55.0  0
55.5  0
56.0  0
56.5  0
57.0  0
57.5  0
58.0  0
58.5  0
59.0  1 *
59.5  1 *
60.0  0
60.5  0
61.0  0
61.5  1 *
62.0  0
62.5  0
63.0  0
63.5  0
64.0  0
64.5  0
65.0  0
65.5  0
66.0  0
66.5  0
67.0  0
67.5  0
68.0  0
68.5  0
69.0  0
69.5  0
70.0  0
70.5  0
71.0  1 *
71.5  0
72.0  0
72.5  0
73.0  0
73.5  0
74.0  0
74.5  0
75.0  0
75.5  0
76.0  0
76.5  0
77.0  0
77.5  0
78.0  0
78.5  0
79.0  0
79.5  0
80.0  0
80.5  0
81.0  0
81.5  0
82.0  0
82.5  0
83.0  0
83.5  0
84.0  0
84.5  0
85.0  0
85.5  0
86.0  0
86.5  0
87.0  0
87.5  0
88.0  0
88.5  0
89.0  0
89.5  0
90.0  0
90.5  0
91.0  0
91.5  0
92.0  0
92.5  0
93.0  0
93.5  0
94.0  0
94.5  0
95.0  0
95.5  0
96.0  0
96.5  0
97.0  0
97.5  1 *
98.0  0
98.5  0
99.0  0
99.5  0

-> <stat> Spam scores for all runs: 100 items; mean 99.87; sdev 0.71
-> <stat> min 94.9387; median 100; max 100
* = 2 items
...
94.0  0
94.5  1 *
95.0  0
95.5  0
96.0  1 *
96.5  1 *
97.0  0
97.5  0
98.0  0
98.5  0
99.0  1 *
99.5 96 ************************************************
-> best cost $0.80
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at 141 cutoff pairs
-> smallest ham & spam cutoffs 0.715 & 0.98
->     fp 0; fn 0; unsure ham 1; unsure spam 3
->     fp rate 0%; fn rate 0%
-> largest ham & spam cutoffs 0.945 & 0.99
->     fp 0; fn 0; unsure ham 1; unsure spam 3
->     fp rate 0%; fn rate 0%

There is a middle ground here:  saying something is "unsure" if

    0.715 <= score < 0.98

works exactly as well as

    0.945 <= score < 0.99

and there are 141-2 = 139 other cutoff pairs from the histogram boundaries
that also achieve cost $0.80 (== 4 msgs in the middle ground, and no errors
outside the middle ground).

The default nbuckets has been boosted to 200, although
TestDriver.printhist() (which does this display and computation) can be
passed any number of buckets "after the fact", provided you saved the
histogram objects as pickles.

There are two new options to support this:

"""
# After the display of a ham+spam histogram pair, you can get a listing of
# all the cutoff values (coinciding with histogram bucket boundaries) that
# minimize
#
#      best_cutoff_fp_weight * (# false positives) +
#      best_cutoff_fn_weight * (# false negatives) +
#      best_cutoff_unsure_weight * (# unsure msgs)
#
# This displays two cutoffs:  hamc and spamc, where
#
#     0.0 <= hamc <= spamc <= 1.0
#
# The idea is that if something scores < hamc, it's called ham; if
# something scores >= spamc, it's called spam; and everything else is
# called "I'm not sure" -- the middle ground.
#
# Note that cvcost.py does a similar analysis.
#
# Note:  You may wish to increase nbuckets, to give this scheme more
# cutoff values to analyze.
compute_best_cutoffs_from_histograms: True
best_cutoff_fp_weight:     10.00
best_cutoff_fn_weight:      1.00
best_cutoff_unsure_weight:  0.20
"""

Note that the default values match cvcost.py's defaults.