[Spambayes] HAMBIAS
Tim Peters
tim.one@comcast.net
Sat, 14 Sep 2002 02:21:15 -0400
HAMBIAS=2.0 is the last deliberate bias in my rework of Graham's scoring
algorithm. Since HAMBIAS > 1.0 artificially lowers spam probabilities, we
can expect it to cut the f-p rate at the expense of boosting the f-n rate.
Now that I can run 10-fold validation on my relatively large datasets, I
thought it might be interesting to try a pair of runs at HAMBIAS=2.0 and at
HAMBIAS=1.0. It was <wink>.
Here are the options for the "before" run:
"""
[TestDriver]
save_trained_pickles = False
show_histograms = True
show_ham_lo = 1.0
show_best_discriminators = 50
show_spam_lo = 1.0
show_ham_hi = 0.0
show_false_positives = True
pickle_basename = class
show_false_negatives = True
nbuckets = 40
show_charlimit = 100000
show_spam_hi = 0.0
[Classifier]
spambias = 1.0
min_spamprob = 0.01
unknown_spamprob = 0.5
hambias = 2.0
max_discriminators = 16
max_spamprob = 0.99
[Tokenizer]
safe_headers = abuse-reports-to
date
errors-to
from
importance
in-reply-to
message-id
mime-version
organization
received
reply-to
return-path
subject
to
user-agent
x-abuse-info
x-complaints-to
x-face
mine_received_headers = False
retain_pure_html_tags = False
count_all_header_lines = False
"""
Note that retain_pure_html_tags is False now: with enough training data,
the c.l.py corpus no longer gets significant benefit from retaining HTML
tags in pure HTML msgs.
The "after" run was identical except for HAMBIAS=1.0. The outcome, with the
10 runs each having these counts:
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
(that's a huge amount of training data!)
false positive percentages
0.000 0.100 lost +(was 0)
0.000 0.050 lost +(was 0)
0.000 0.050 lost +(was 0)
0.000 0.050 lost +(was 0)
0.050 0.050 tied
0.000 0.100 lost +(was 0)
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.100 0.150 lost +50.00%
won 0 times
tied 4 times
lost 6 times
total unique fp went from 3 to 11 lost +266.67%
mean fp % went from 0.015 to 0.055 lost +266.67%
false negative percentages
0.218 0.145 won -33.49%
0.436 0.145 won -66.74%
0.073 0.000 won -100.00%
0.218 0.073 won -66.51%
0.218 0.073 won -66.51%
0.291 0.000 won -100.00%
0.291 0.218 won -25.09%
0.218 0.145 won -33.49%
0.291 0.000 won -100.00%
0.073 0.000 won -100.00%
won 10 times
tied 0 times
lost 0 times
total unique fn went from 32 to 11 won -65.63%
mean fn % went from 0.232727272727 to 0.0800000000002 won -65.62%
The rates are too low to measure reliably, but check it out anyway: the f-n
and f-p rates are very close after (vs a factor of 17 apart before), and
there are 11 mistakes of each kind after (vs 3 fn and 32 fp before).
The obvious conjecture is that, given enough training data, removing the
last deliberate bias in fact yields results favoring neither ham nor spam at
the expense of the other.
That's comforting, at least from a useless theoretical POV <wink>.