[Spambayes] First result from Gary Robinson's ideas
Neale Pickett
neale@woozle.org
18 Sep 2002 15:43:33 -0700
So then, Neale Pickett <neale@woozle.org> is all like:
> total unique fn went from 12 to 1000 lost +8233.33%
> mean fn % went from 1.2 to 100.0 lost +8233.33%
Please disregard those results. This says that every single message in
my spam corpus got tagged as ham with this change. Investigating, I
found that I had neglegted to remove a line calculating prob after
inserting Tim's new code, so everything was getting a probability of
0.5. On the positive side, my FP rate went to 0! ;)
So here's another run with the *right* code change.
The first run, run1, is using Tim's original classifier code. The
second is using the following modification Tim proposed to implement
Gary's first suggestion:
prob_product = inverse_prob_product = 1.0
+ P = Q = 1.0
+ num_clues = 0
for distance, prob, word, record in nbest:
if prob is None: # it's one of the dummies nbest started
with
continue
if record is not None: # else wordinfo doesn't know about
it
record.killcount += 1
if evidence:
clues.append((word, prob))
- prob_product *= prob
- inverse_prob_product *= 1.0 - prob
+ num_clues += 1
+ P *= 1.0 - prob
+ Q *= prob
+
+ if num_clues:
+ P = 1.0 - P**(1./num_clues)
+ Q = 1.0 - Q**(1./num_clues)
+ prob = (P-Q)/(P+Q) # in -1 .. 1
+ prob = 0.5 + prob/2 # shift to 0 .. 1
+ else:
+ prob = 0.5
- prob = prob_product / (prob_product + inverse_prob_product)
Additionally, I changed the "spam cutoff" from 0.9 to 0.5. Comparing
the results before (run1) and after (run2), I get:
"""
run1s -> run2s
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
false positive percentages
2.000 2.000 tied
1.500 1.500 tied
2.000 2.000 tied
1.000 1.500 lost +50.00%
0.500 0.500 tied
won 0 times
tied 4 times
lost 1 times
total unique fp went from 14 to 15 lost +7.14%
mean fp % went from 1.4 to 1.5 lost +7.14%
false negative percentages
1.500 1.000 won -33.33%
1.000 1.000 tied
1.500 1.500 tied
1.500 1.500 tied
1.000 1.000 tied
won 1 times
tied 4 times
lost 0 times
total unique fn went from 13 to 12 won -7.69%
mean fn % went from 1.3 to 1.2 won -7.69%
"""
So false positives basically stayed the same--the one case where the
false positives got worse, it was only by one message, which I would
imagine is within the margin of error, but I Am Not A Statistician :).
But, as Tim said earlier, what's really interesting is the distribution
of scores across all runs. The first run, without Gary's modification,
gives me the following distributions:
"""
Ham distribution for all runs:
* = 17 items
0.00 984 **********************************************************
2.50 1 *
5.00 0
7.50 0
10.00 0
12.50 0
15.00 0
17.50 0
20.00 0
22.50 0
25.00 0
27.50 0
30.00 0
32.50 0
35.00 0
37.50 0
40.00 0
42.50 0
45.00 0
47.50 0
50.00 1 *
52.50 0
55.00 0
57.50 0
60.00 0
62.50 0
65.00 0
67.50 0
70.00 0
72.50 0
75.00 0
77.50 0
80.00 0
82.50 0
85.00 0
87.50 0
90.00 0
92.50 0
95.00 0
97.50 14 *
Spam distribution for all runs:
* = 17 items
0.00 11 *
2.50 0
5.00 0
7.50 0
10.00 0
12.50 0
15.00 0
17.50 0
20.00 0
22.50 0
25.00 0
27.50 0
30.00 0
32.50 0
35.00 0
37.50 0
40.00 0
42.50 0
45.00 0
47.50 0
50.00 1 *
52.50 0
55.00 0
57.50 0
60.00 0
62.50 0
65.00 0
67.50 0
70.00 0
72.50 0
75.00 0
77.50 0
80.00 0
82.50 1 *
85.00 0
87.50 0
90.00 0
92.50 2 *
95.00 3 *
97.50 982 **********************************************************
"""
Your typical Grahamian black-or-white picture, with little middle
ground. With Gary's idea, however, comes many more shades of gray:
"""
Ham distribution for all runs:
* = 12 items
0.00 681 *********************************************************
2.50 62 ******
5.00 18 **
7.50 10 *
10.00 14 **
12.50 33 ***
15.00 40 ****
17.50 28 ***
20.00 14 **
22.50 22 **
25.00 5 *
27.50 11 *
30.00 13 **
32.50 10 *
35.00 6 *
37.50 5 *
40.00 3 *
42.50 5 *
45.00 5 *
47.50 0
50.00 1 *
52.50 4 *
55.00 1 *
57.50 1 *
60.00 2 *
62.50 1 *
65.00 0
67.50 1 *
70.00 2 *
72.50 0
75.00 0
77.50 0
80.00 0
82.50 0
85.00 0
87.50 0
90.00 0
92.50 0
95.00 0
97.50 2 *
Spam distribution for all runs:
* = 14 items
0.00 1 *
2.50 0
5.00 0
7.50 0
10.00 0
12.50 0
15.00 0
17.50 0
20.00 0
22.50 0
25.00 0
27.50 0
30.00 0
32.50 4 *
35.00 0
37.50 1 *
40.00 1 *
42.50 1 *
45.00 3 *
47.50 0
50.00 4 *
52.50 4 *
55.00 2 *
57.50 9 *
60.00 13 *
62.50 9 *
65.00 12 *
67.50 14 *
70.00 9 *
72.50 15 **
75.00 8 *
77.50 7 *
80.00 11 *
82.50 12 *
85.00 9 *
87.50 0
90.00 1 *
92.50 2 *
95.00 13 *
97.50 835 ************************************************************
"""
So from my perspective (and again, IANAS) it looks like the algorithm
has gained some humility and is admitting when it's not sure about
stuff. I can't say this change is a clear win for my miniscule data
set, but it *does* appear to make the probability more meaningful.
Almost like the difference between linear space and log space.
Neale