[Spambayes] Moving closer to Gary's ideal

Sat, 21 Sep 2002 07:16:37 -0400

Update.  My test data looks clean again.  Added 251 new spam, and purged a
ham that's been hiding in the spam forever.  Now using 20,000 ham and 14,000
spam.

There were two cut'n'paste typos in the code that affected results.  Fixing
those increased the separation between the ham and spam means over what I
reported last time.

The options used here are as reported last time:

"""
[Classifier]
use_robinson_probability: True
use_robinson_combining: True
max_discriminators: 1500

[TestDriver]
spam_cutoff: 0.50
"""

This implements everything we talked about, except for the disputed ranking
step.  All biases are gone, and no limits are placed on the probabilities.
For the probability adjustment step (FI1), I'm using a=1 and x=0.5 (I can
play with this, but doubt they're the most useful things to poke at; and we
moved to using 0.5 for the "unknown word probability" under Graham's scheme
long ago).

Here's a before-and-after 10-fold cross validation run, where before is the
default (our highly tweaked Graham scheme).  Each run trained on 18000 hams
& 12600 spams, then predicted 2000 disjoint hams & 1400 disjoint spams.  As
will be clear soon, the results aren't as bizarre as they look:

false positive percentages
    0.000  0.250  lost  +(was 0)
    0.000  0.200  lost  +(was 0)
    0.000  0.250  lost  +(was 0)
    0.000  0.100  lost  +(was 0)
    0.050  0.450  lost  +800.00%
    0.000  0.250  lost  +(was 0)
    0.000  0.150  lost  +(was 0)
    0.050  0.300  lost  +500.00%
    0.000  0.200  lost  +(was 0)
    0.100  0.350  lost  +250.00%

won   0 times
tied  0 times
lost 10 times

total unique fp went from 4 to 50 lost  +1150.00%
mean fp % went from 0.02 to 0.25 lost  +1150.00%

false negative percentages
    0.214  0.000  won   -100.00%
    0.286  0.000  won   -100.00%
    0.000  0.000  tied
    0.143  0.000  won   -100.00%
    0.143  0.000  won   -100.00%
    0.286  0.000  won   -100.00%
    0.143  0.071  won    -50.35%
    0.143  0.000  won   -100.00%
    0.286  0.000  won   -100.00%
    0.071  0.000  won   -100.00%

won   9 times
tied  1 times
lost  0 times

total unique fn went from 24 to 1 won    -95.83%
mean fn % went from 0.171428571428 to 0.00714285714286 won    -95.83%

So this test was a disaster for the false positive rate and a huge win for
the false negative rate.  This is because 0.50 is too low a cutoff now:

Ham distribution for all runs:
* = 86 items
  0.00    0
  2.50    0
  5.00    0
  7.50    0
 10.00    0
 12.50    0
 15.00    0
 17.50    0
 20.00   26 *
 22.50  155 **
 25.00  627 ********
 27.50 1859 **********************
 30.00 3780 ********************************************
 32.50 5108 ************************************************************
 35.00 4264 **************************************************
 37.50 2450 *****************************
 40.00 1056 *************
 42.50  395 *****
 45.00  178 ***
 47.50   52 *
 50.00   30 *
 52.50   13 *
 55.00    4 *
 57.50    1 *
 60.00    1 *
 62.50    1 *
 65.00    0
 67.50    0
 70.00    0
 72.50    0
 75.00    0
 77.50    0
 80.00    0
 82.50    0
 85.00    0
 87.50    0
 90.00    0
 92.50    0
 95.00    0
 97.50    0

Spam distribution for all runs:
* = 50 items
  0.00    0
  2.50    0
  5.00    0
  7.50    0
 10.00    0
 12.50    0
 15.00    0
 17.50    0
 20.00    0
 22.50    0
 25.00    0
 27.50    0
 30.00    0
 32.50    0
 35.00    0
 37.50    0
 40.00    0
 42.50    1 *
 45.00    0
 47.50    0
 50.00    3 *
 52.50    6 *
 55.00   17 *
 57.50   40 *
 60.00   76 **
 62.50  171 ****
 65.00  394 ********
 67.50  710 ***************
 70.00 1247 *************************
 72.50 2358 ************************************************
 75.00 2986 ************************************************************
 77.50 2659 ******************************************************
 80.00 1957 ****************************************
 82.50 1069 **********************
 85.00  192 ****
 87.50   31 *
 90.00   61 **
 92.50   22 *
 95.00    0
 97.50    0

So if the cutoff were boosted to 0.575, we'd lose 30+13+4 = 47 fp, and gain
3+6+17 = 26 fn, for a grand total of 3 fp and 27 fn.  That would leave it
essentially indistinguishable from the "before" run, but gets there without
artificial biases and limits.  Good show!  Against it, I have no idea how to
*predict* where to put the cutoff, and that simply wasn't an issue before
(0.90 "just worked", and on this particular large test only 5 of the 20,000
ham scored above 0.10, while only 24 of the 14,000 spam scored below 0.90).

The highest-scoring ham was again the fellow who added a one-line comment to
a quote of an entire Nigerian scam msg.  The second-highest was again the
lady looking for a Python course in the UK, damned by her employer's much
longer obnoxious sig.  These are all familiar.  Something we haven't seen
for a few weeks is the systematic reappearance of conference announcements
among the high-scoring ham; under the current scheme, tokenization
preserving case also hates those, and so does tokenization via word bigrams,
and ditto via character 5-grams; the tokenization we're using now
(case-folded unigrams but preserving punctuation) didn't hate them under the
Graham scheme (indeed, that's the only tokenization scheme I've tried that
*doesn't* hate them; they appear to benefit from Graham's low
max_discriminators because the really spammish "visit our website for more
information!" stuff doesn't show up until near the end, and by then enough
0.01 clues have been seen that the end doesn't matter).

The lowest-scoring spam is embarrassing:  it has

    Subject: HOW TO BECOME A MILLIONAIRE IN WEEKS!!

and the body consists solely of a uuencoded text file, which we don't
decipher.  That's the only spam to score below 0.5.  The tokenizer doesn't
give the inferencer much to go on here.  You'd think, e.g., that MILLIONAIRE
in the subject line is a spam clue; and it is, but it's *so* blatant that
few spam actually do that, leaving

    prob('subject:MILLIONAIRE') = 0.666667

Two other words in the subject were actually stronger clues:

    prob('subject:WEEKS') = 0.833333
    prob('subject:HOW') = 0.867101

Everyone who tests this (please do!  it looks very promising, although my
data only supports that it's not a regression -- I *expect* it will do
better for some of you), pay attention to your score histograms and figure
out the best value for spam_cutoff from them.  That would be a good number
to report.  I'd also appreciate it if you played with max_discriminators
here, and/or with some other gimmick aiming to keep nearly-neutral words out
of the scoring; if, e.g., the presence of "the" (a canonical <wink> spamprob
0.5 word) moves a score closer to 0.5, that's really not helping things (as
far as I can see).  Note that if you fiddle with both, they're most likely
not independent, so be sure to keep looking at the histograms (they reveal a
hell of a lot more than the raw error rates do).