[Spambayes] HAMBIAS

Tim Peters tim.one@comcast.net
Sat, 14 Sep 2002 02:21:15 -0400


HAMBIAS=2.0 is the last deliberate bias in my rework of Graham's scoring
algorithm.  Since HAMBIAS > 1.0 artificially lowers spam probabilities, we
can expect it to cut the f-p rate at the expense of boosting the f-n rate.

Now that I can run 10-fold validation on my relatively large datasets, I
thought it might be interesting to try a pair of runs at HAMBIAS=2.0 and at
HAMBIAS=1.0.  It was <wink>.

Here are the options for the "before" run:

"""
[TestDriver]
save_trained_pickles = False
show_histograms = True
show_ham_lo = 1.0
show_best_discriminators = 50
show_spam_lo = 1.0
show_ham_hi = 0.0
show_false_positives = True
pickle_basename = class
show_false_negatives = True
nbuckets = 40
show_charlimit = 100000
show_spam_hi = 0.0

[Classifier]
spambias = 1.0
min_spamprob = 0.01
unknown_spamprob = 0.5
hambias = 2.0
max_discriminators = 16
max_spamprob = 0.99

[Tokenizer]
safe_headers = abuse-reports-to
        date
        errors-to
        from
        importance
        in-reply-to
        message-id
        mime-version
        organization
        received
        reply-to
        return-path
        subject
        to
        user-agent
        x-abuse-info
        x-complaints-to
        x-face
mine_received_headers = False
retain_pure_html_tags = False
count_all_header_lines = False
"""

Note that retain_pure_html_tags is False now:  with enough training data,
the c.l.py corpus no longer gets significant benefit from retaining HTML
tags in pure HTML msgs.

The "after" run was identical except for HAMBIAS=1.0.  The outcome, with the
10 runs each having these counts:

-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams

(that's a huge amount of training data!)

false positive percentages
    0.000  0.100  lost  +(was 0)
    0.000  0.050  lost  +(was 0)
    0.000  0.050  lost  +(was 0)
    0.000  0.050  lost  +(was 0)
    0.050  0.050  tied
    0.000  0.100  lost  +(was 0)
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.100  0.150  lost   +50.00%

won   0 times
tied  4 times
lost  6 times

total unique fp went from 3 to 11 lost  +266.67%
mean fp % went from 0.015 to 0.055 lost  +266.67%

false negative percentages
    0.218  0.145  won    -33.49%
    0.436  0.145  won    -66.74%
    0.073  0.000  won   -100.00%
    0.218  0.073  won    -66.51%
    0.218  0.073  won    -66.51%
    0.291  0.000  won   -100.00%
    0.291  0.218  won    -25.09%
    0.218  0.145  won    -33.49%
    0.291  0.000  won   -100.00%
    0.073  0.000  won   -100.00%

won  10 times
tied  0 times
lost  0 times

total unique fn went from 32 to 11 won    -65.63%
mean fn % went from 0.232727272727 to 0.0800000000002 won    -65.62%

The rates are too low to measure reliably, but check it out anyway:  the f-n
and f-p rates are very close after (vs a factor of 17 apart before), and
there are 11 mistakes of each kind after (vs 3 fn and 32 fp before).

The obvious conjecture is that, given enough training data, removing the
last deliberate bias in fact yields results favoring neither ham nor spam at
the expense of the other.

That's comforting, at least from a useless theoretical POV <wink>.