[Spambayes] There Can Be Only One

Thu, 26 Sep 2002 17:34:49 -0400

[Greg Ward]
> ...
> Here are the tokenization options used for both runs:
>
> [Tokenizer]
> basic_header_tokenize: True
> basic_header_skip: received envelope-to delivered-to delivery-date
> mine_received_headers: True
>
> And here are the additional options for my Robinson run:
>
> [Classifier]
> use_robinson_combining: True
> use_robinson_probability: True
> robinson_probability_x: 0.5
> robinson_probability_a: 0.5

Note that everyone(?) has reported better results with "a" small than this;
e.g., I used 0.225 (or something like that -- read the archives <wink>) for
the last run I reported.  The value of this parameter is important:  it
makes a difference.  This group grope is the first systematic attempt at
tuning it.

> max_discriminators: 150
> robinson_minimum_prob_strength: 0.1
>
> [TestDriver]
> spam_cutoff: 0.525
>
> ...ie. nothing special there.  Turns out that I don't benefit from
> spam_cutoff < 0.5 after all, as I initially thought.  Oh well.
>
> Comparison stats, Graham vs. Robinson:
>
> """
> run1.logs.txt -> run2.logs.txt
> [...]
> false positive percentages
>     0.000  0.500  lost  +(was 0)
>     0.000  0.000  tied
>     0.000  0.000  tied
>     0.500  1.500  lost  +200.00%
>     0.000  0.000  tied
>     1.000  1.000  tied
>     0.000  0.000  tied
>     0.000  0.000  tied
>     0.500  0.500  tied
>     1.000  2.500  lost  +150.00%
>
> won   0 times
> tied  7 times
> lost  3 times
>
> total unique fp went from 6 to 12 lost  +100.00%
> mean fp % went from 0.3 to 0.6 lost  +100.00%
>
> false negative percentages
>     2.000  1.000  won    -50.00%
>     2.000  1.500  won    -25.00%
>     1.000  0.500  won    -50.00%
>     1.500  1.000  won    -33.33%
>     1.000  0.500  won    -50.00%
>     0.000  0.500  lost  +(was 0)
>     1.000  1.000  tied
>     1.500  1.000  won    -33.33%
>     1.000  1.000  tied
>     1.000  1.000  tied
>
> won   6 times
> tied  3 times
> lost  1 times
>
> total unique fn went from 24 to 18 won    -25.00%
> mean fn % went from 1.2 to 0.9 won    -25.00%
>
> ham mean                     ham sdev

Note that means and sdevs aren't interesting when comparing Graham to
Robinson -- they're wildly different in their basic combining approach.

> ...
> Bottom line: Graham is better at avoiding FPs, Robinson is better at
> avoiding FNs.

This has everything to do with your spam_cutoff choice in the Robinson run.
Changing that necessarily improves one error rate at the expense of harming
the other.  So if you *want* fewer f-p and more f-n, raise spam_cutoff.  The
histograms can guide you without needing to rerun the test; increasing
nbuckets helps when there's significant overlap (changing nbuckets doesn't
change any scores, it just allows finer-grained analysis of what would
happen if you were to change spam_cutoff).  See below.

> Great.  At least there's an obvious middle ground with Robinson.

Indeed there is, and it's a good one.  Unfortunately, exactly where it lies
seems quite corpus-dependent.

>
> and now Robinson (not nearly as pretty as Tim's, but still quite
> tolerable):

The last histos I posted covered 20,000 ham and 14,000 spam, and look much
more like normal distributions because of that.  READ MY NOTES inside this
quote of your histograms:

> -> <stat> Ham scores for all runs: 2000 items; mean 20.62; sdev 9.21
> * = 5 items
>   0.00   0
>   2.50  24 *****
>   5.00  79 ****************
>   7.50 125 *************************
>  10.00 175 ***********************************
>  12.50 162 *********************************
>  15.00 199 ****************************************
>  17.50 245 *************************************************
>  20.00 239 ************************************************
>  22.50 219 ********************************************
>  25.00 147 ******************************
>  27.50 114 ***********************
>  30.00  85 *****************
>  32.50  52 ***********
>  35.00  42 *********
>  37.50  27 ******
>  40.00  15 ***
>  42.50  11 ***
>  45.00  12 ***
>  47.50   9 **
>  50.00   7 **
------------------------- 0.525 is your spam_cutoff
>  52.50   5 *
------------------------- if you raised it to 0.55, you'd lose 5 f-p
>  55.00   2 *
------------------------- and if you raised it to 0.575, 5+2 = 7 f-p
>  57.50   4 *
------------------------- and if you raised it to 0.6, 5+2+4 = 11 f-p
                          at which point you'd have only 1 f-p left
>  60.00   0
>  62.50   1 *
>  65.00   0
>  67.50   0
>  70.00   0
>  72.50   0
>  75.00   0
>  77.50   0
>  80.00   0
>  82.50   0
>  85.00   0
>  87.50   0
>  90.00   0
>  92.50   0
>  95.00   0
>  97.50   0
>
> -> <stat> Spam scores for all runs: 2000 items; mean 81.60; sdev 11.11
> * = 4 items
>   0.00   0
>   2.50   0
>   5.00   0
>   7.50   0
>  10.00   0
>  12.50   0
>  15.00   0
>  17.50   0
>  20.00   0
>  22.50   0
>  25.00   0
>  27.50   0
>  30.00   0
>  32.50   0
>  35.00   0
>  37.50   0
>  40.00   2 *
>  42.50   4 *
>  45.00   3 *
>  47.50   5 **
>  50.00   4 *
------------------------------------- your spam_cutoff
>  52.50   8 **
------------------------------------- raised to 0.55, you'd gain 8 f-n
>  55.00  15 ****
>  57.50  23 ******
>  60.00  31 ********
>  62.50  49 *************
>  65.00  78 ********************
>  67.50 114 *****************************
>  70.00 137 ***********************************
>  72.50 139 ***********************************
>  75.00 114 *****************************
>  77.50 112 ****************************
>  80.00 117 ******************************
>  82.50 139 ***********************************
>  85.00 162 *****************************************
>  87.50 190 ************************************************
>  90.00 213 ******************************************************
>  92.50 140 ***********************************
>  95.00  88 **********************
>  97.50 113 *****************************
> -> best cutoff for all runs: 0.525
> ->     with 12 fp + 18 fn = 30 mistakes
>
> Chew on that, stat-boy!

I did <wink>.  If you raised your spam_cutoff to 0.55, your error rates
would be almost identical across the two methods, except that the Robinson
scheme's f-p and f-n almost all live in a narrow band.  You're almost
certainly not using the best value for robinson_probability_a.  Note that
the default histogram analysis minimizes fp+fn.  If you hate fp more than
you hate fn (and you've said elsewhere that you do), figure out exactly how
much more you hate it <wink>.  Call that h.  Then the histogram analysis
will minimize

    h * fp + fn

instead if you set option best_cutoff_fp_weight to h.  Increasing nbuckets
will help it make a finer-grained decision.

It would be good to try again with a better value for robinson_probability_a
(although part of this exercise is for you to get your data to tell you
which value works best for it, not to have someone else guess).