[Spambayes] Chi True results
Brad Clements
bkc@murkworks.com
Sat, 12 Oct 2002 15:07:50 -0400
I ran this twice, first to get the recommended spam cutoff, the 2nd time with the
recommended cutoff in the .ini
then I compared it against the tim_combine_true test I ran previously.
In this message: .ini, cmp.py results, histograms from chi true run.
[Tokenizer]
mine_received_headers: True
[Classifier]
use_central_limit = False
use_central_limit2 = False
use_central_limit3 = False
use_tim_combining: False
use_chi_squared_combining: True
[TestDriver]
spam_cutoff: 0.98
show_false_negatives: True
show_false_positives: True
nbuckets: 200
best_cutoff_fp_weight: 10
show_spam_lo: 0.4
show_spam_hi: 0.80
show_ham_lo = 0.40
show_ham_hi = 0.80
show_charlimit: 10000
save_trained_pickles: True
save_histogram_pickles: True
results/timcombinetrues.txt -> results/chitrues.txt
-> <stat> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams
false positive percentages
1.077 0.154 won -85.70%
0.769 0.231 won -69.96%
0.769 0.077 won -89.99%
0.923 0.154 won -83.32%
0.769 0.154 won -79.97%
0.538 0.077 won -85.69%
0.538 0.077 won -85.69%
0.692 0.000 won -100.00%
0.769 0.231 won -69.96%
0.692 0.000 won -100.00%
won 10 times
tied 0 times
lost 0 times
total unique fp went from 98 to 15 won -84.69%
mean fp % went from 0.753846153846 to 0.115384615385 won -84.69%
false negative percentages
0.154 0.846 lost +449.35%
0.154 1.231 lost +699.35%
0.231 1.154 lost +399.57%
0.077 0.615 lost +698.70%
0.000 0.923 lost +(was 0)
0.231 1.308 lost +466.23%
0.231 0.692 lost +199.57%
0.077 1.077 lost +1298.70%
0.154 1.231 lost +699.35%
0.231 1.231 lost +432.90%
won 0 times
tied 0 times
lost 10 times
total unique fn went from 20 to 134 lost +570.00%
mean fn % went from 0.153846153846 to 1.03076923077 lost +570.00%
ham mean ham sdev
12.23 1.40 -88.55% 9.02 8.67 -3.88%
12.04 1.12 -90.70% 8.57 8.09 -5.60%
12.08 1.12 -90.73% 8.44 8.02 -4.98%
12.21 1.26 -89.68% 8.65 8.62 -0.35%
11.98 1.06 -91.15% 8.40 8.03 -4.40%
12.20 1.01 -91.72% 8.16 6.87 -15.81%
11.69 0.85 -92.73% 7.80 6.57 -15.77%
11.61 0.96 -91.73% 7.91 7.06 -10.75%
11.63 1.15 -90.11% 8.31 8.38 +0.84%
11.60 1.01 -91.29% 7.94 7.62 -4.03%
ham mean and sdev for all runs
11.93 1.09 -90.86% 8.33 7.83 -6.00%
spam mean spam sdev
90.31 99.74 +10.44% 7.59 3.59 -52.70%
90.59 99.67 +10.02% 7.68 4.17 -45.70%
90.72 99.68 +9.88% 7.40 4.12 -44.32%
90.91 99.83 +9.81% 7.16 2.68 -62.57%
90.54 99.84 +10.27% 6.93 2.20 -68.25%
90.68 99.66 +9.90% 7.23 4.29 -40.66%
90.49 99.67 +10.14% 7.25 4.68 -35.45%
90.61 99.79 +10.13% 7.29 2.98 -59.12%
90.93 99.75 +9.70% 7.21 3.24 -55.06%
90.40 99.54 +10.11% 7.80 5.07 -35.00%
spam mean and sdev for all runs
90.62 99.72 +10.04% 7.36 3.80 -48.37%
ham/spam mean difference: 78.69 98.63 +19.94
--
histogram from chi: true
-> <stat> Ham scores for all runs: 13000 items; mean 1.09; sdev 7.83
-> <stat> min -2.66454e-13; median 2.85882e-12; max 100
* = 204 items
0.0 12433 *************************************************************
0.5 71 *
1.0 43 *
1.5 33 *
2.0 14 *
2.5 15 *
3.0 12 *
3.5 5 *
4.0 14 *
4.5 11 *
5.0 6 *
5.5 9 *
6.0 9 *
6.5 5 *
7.0 6 *
7.5 3 *
8.0 7 *
8.5 2 *
9.0 5 *
9.5 5 *
10.0 5 *
10.5 5 *
11.0 3 *
11.5 4 *
12.0 7 *
12.5 2 *
13.0 3 *
13.5 2 *
14.0 3 *
14.5 4 *
15.0 3 *
15.5 3 *
16.0 0
16.5 3 *
17.0 2 *
17.5 1 *
18.0 0
18.5 5 *
19.0 3 *
19.5 1 *
20.0 1 *
20.5 3 *
21.0 0
21.5 1 *
22.0 1 *
22.5 2 *
23.0 1 *
23.5 2 *
24.0 2 *
24.5 0
25.0 0
25.5 3 *
26.0 2 *
26.5 2 *
27.0 1 *
27.5 1 *
28.0 2 *
28.5 3 *
29.0 2 *
29.5 2 *
30.0 1 *
30.5 3 *
31.0 1 *
31.5 1 *
32.0 4 *
32.5 2 *
33.0 2 *
33.5 3 *
34.0 1 *
34.5 3 *
35.0 1 *
35.5 3 *
36.0 5 *
36.5 4 *
37.0 0
37.5 3 *
38.0 1 *
38.5 1 *
39.0 0
39.5 2 *
40.0 2 *
40.5 3 *
41.0 2 *
41.5 1 *
42.0 1 *
42.5 3 *
43.0 2 *
43.5 1 *
44.0 2 *
44.5 3 *
45.0 3 *
45.5 5 *
46.0 1 *
46.5 3 *
47.0 1 *
47.5 5 *
48.0 1 *
48.5 3 *
49.0 9 *
49.5 11 *
50.0 8 *
50.5 1 *
51.0 3 *
51.5 1 *
52.0 7 *
52.5 3 *
53.0 2 *
53.5 1 *
54.0 0
54.5 1 *
55.0 2 *
55.5 0
56.0 3 *
56.5 0
57.0 0
57.5 1 *
58.0 2 *
58.5 0
59.0 0
59.5 1 *
60.0 1 *
60.5 1 *
61.0 0
61.5 0
62.0 0
62.5 0
63.0 2 *
63.5 0
64.0 0
64.5 0
65.0 0
65.5 1 *
66.0 0
66.5 0
67.0 0
67.5 0
68.0 0
68.5 2 *
69.0 0
69.5 1 *
70.0 1 *
70.5 0
71.0 0
71.5 1 *
72.0 0
72.5 1 *
73.0 0
73.5 0
74.0 1 *
74.5 0
75.0 0
75.5 0
76.0 2 *
76.5 0
77.0 0
77.5 0
78.0 0
78.5 1 *
79.0 0
79.5 0
80.0 1 *
80.5 1 *
81.0 1 *
81.5 0
82.0 2 *
82.5 0
83.0 0
83.5 1 *
84.0 0
84.5 3 *
85.0 0
85.5 1 *
86.0 1 *
86.5 0
87.0 1 *
87.5 1 *
88.0 2 *
88.5 1 *
89.0 0
89.5 0
90.0 2 *
90.5 0
91.0 0
91.5 1 *
92.0 0
92.5 1 *
93.0 1 *
93.5 0
94.0 2 *
94.5 1 *
95.0 1 *
95.5 2 *
96.0 1 *
96.5 2 *
97.0 1 *
97.5 2 *
98.0 0
98.5 0
99.0 3 *
99.5 12 * thanks for joining paypal, ETrade news, HP Symposiom, Registration ack from Cingular,
EDN renewal, X10 newsletter (argh!), FAFSA US Dept Education renewal :-(,
United Connection, Network Computing Renewal, Infotel Distributing
-> <stat> Spam scores for all runs: 13000 items; mean 99.72; sdev 3.80
This histogram seems broken, I have 4 or 5 spams with prob < .0.05
> Survey on Software Reuse Views and Activity
> You are invited to participate in my Dissertation research on the topic of ^M
> Software Reuse.
(naw)
VoIP solutions for providers
HP Enterprise Technical Symposium (oops, this should be ham, guess I got sick of
getting these)
-> <stat> min 0.000127988; median 100; max 100
* = 210 items
0.0 1 * ***New SAP Opportunities*** Client interviewing now!!
0.5 1 * Certified IT professional with over 6 years of Experience on Design
and Coding.
1.0 0
1.5 1 * Senior Consultant with Experience on JD Edwards, ONE WORLD, XE, CNC,
AS/400 is available
2.0 0
2.5 1 * Fax / Copier Sales / service call 2078787
3.0 1 * Development Services on Telecom/Datacom Protocols
3.5 0
4.0 0
4.5 0
5.0 0
5.5 0
6.0 0
6.5 0
7.0 1 * Certified IT professional with over 6 years of Experience on Design
and Coding.
7.5 0
8.0 0
8.5 1 *
9.0 0
9.5 0
10.0 0
10.5 0
11.0 0
11.5 0
12.0 0
12.5 0
13.0 0
13.5 0
14.0 0
14.5 0
15.0 0
15.5 0
16.0 1 * Use the Session Scheduler to personalize your training (hp, probably mis-classified, guess I did get sick of them)
16.5 1 * VoIP solutions for providers
17.0 0
17.5 0
18.0 0
18.5 0
19.0 0
19.5 0
20.0 0
20.5 1 *
21.0 0
21.5 0
22.0 2 *
22.5 0
23.0 0
23.5 0
24.0 0
24.5 1 *
25.0 0
25.5 0
26.0 0
26.5 0
27.0 0
27.5 0
28.0 0
28.5 0
29.0 0
29.5 0
30.0 0
30.5 0
31.0 1 *
31.5 0
32.0 0
32.5 0
33.0 0
33.5 0
34.0 0
34.5 0
35.0 0
35.5 0
36.0 0
36.5 0
37.0 0
37.5 0
38.0 0
38.5 1 *
39.0 0
39.5 0
40.0 0
40.5 0
41.0 0
41.5 0
42.0 0
42.5 0
43.0 0
43.5 0
44.0 1 *
44.5 2 *
45.0 0
45.5 0
46.0 0
46.5 0
47.0 0
47.5 0
48.0 0
48.5 1 *
49.0 0
49.5 1 *
50.0 9 *
50.5 0
51.0 2 *
51.5 0
52.0 1 *
52.5 0
53.0 1 *
53.5 0
54.0 0
54.5 0
55.0 0
55.5 2 *
56.0 1 *
56.5 0
57.0 1 *
57.5 0
58.0 0
58.5 0
59.0 0
59.5 0
60.0 0
60.5 0
61.0 0
61.5 0
62.0 0
62.5 2 *
63.0 0
63.5 0
64.0 0
64.5 2 *
65.0 0
65.5 1 *
66.0 1 *
66.5 0
67.0 0
67.5 0
68.0 0
68.5 1 *
69.0 0
69.5 0
70.0 0
70.5 0
71.0 0
71.5 0
72.0 0
72.5 1 *
73.0 0
73.5 1 *
74.0 0
74.5 0
75.0 0
75.5 0
76.0 5 *
76.5 0
77.0 1 *
77.5 2 *
78.0 2 *
78.5 1 *
79.0 2 *
79.5 2 *
80.0 1 *
80.5 0
81.0 1 *
81.5 0
82.0 1 *
82.5 1 *
83.0 2 *
83.5 1 *
84.0 3 *
84.5 0
85.0 1 *
85.5 1 *
86.0 2 *
86.5 1 *
87.0 0
87.5 0
88.0 2 *
88.5 0
89.0 1 *
89.5 1 *
90.0 2 *
90.5 5 *
91.0 0
91.5 4 *
92.0 3 *
92.5 2 *
93.0 1 *
93.5 3 *
94.0 2 *
94.5 5 *
95.0 3 *
95.5 4 *
96.0 5 *
96.5 6 *
97.0 5 *
97.5 4 *
98.0 10 *
98.5 16 *
99.0 33 *
99.5 12807 *************************************************************
-> best cutoff for all runs: 0.98
-> with weighted total 10*15 fp + 134 fn = 284
-> fp rate 0.115% fn rate 1.03%
saving ham histogram pickle to class_hamhist.pik
saving spam histogram pickle to class_spamhist.pik
Brad Clements, bkc@murkworks.com (315)268-1000
http://www.murkworks.com (315)268-9812 Fax
AOL-IM: BKClements