[Spambayes] There Can Be Only One
Tim Peters
tim.one@comcast.net
Thu, 26 Sep 2002 17:34:49 -0400
[Greg Ward]
> ...
> Here are the tokenization options used for both runs:
>
> [Tokenizer]
> basic_header_tokenize: True
> basic_header_skip: received envelope-to delivered-to delivery-date
> mine_received_headers: True
>
> And here are the additional options for my Robinson run:
>
> [Classifier]
> use_robinson_combining: True
> use_robinson_probability: True
> robinson_probability_x: 0.5
> robinson_probability_a: 0.5
Note that everyone(?) has reported better results with "a" small than this;
e.g., I used 0.225 (or something like that -- read the archives <wink>) for
the last run I reported. The value of this parameter is important: it
makes a difference. This group grope is the first systematic attempt at
tuning it.
> max_discriminators: 150
> robinson_minimum_prob_strength: 0.1
>
> [TestDriver]
> spam_cutoff: 0.525
>
> ...ie. nothing special there. Turns out that I don't benefit from
> spam_cutoff < 0.5 after all, as I initially thought. Oh well.
>
> Comparison stats, Graham vs. Robinson:
>
> """
> run1.logs.txt -> run2.logs.txt
> [...]
> false positive percentages
> 0.000 0.500 lost +(was 0)
> 0.000 0.000 tied
> 0.000 0.000 tied
> 0.500 1.500 lost +200.00%
> 0.000 0.000 tied
> 1.000 1.000 tied
> 0.000 0.000 tied
> 0.000 0.000 tied
> 0.500 0.500 tied
> 1.000 2.500 lost +150.00%
>
> won 0 times
> tied 7 times
> lost 3 times
>
> total unique fp went from 6 to 12 lost +100.00%
> mean fp % went from 0.3 to 0.6 lost +100.00%
>
> false negative percentages
> 2.000 1.000 won -50.00%
> 2.000 1.500 won -25.00%
> 1.000 0.500 won -50.00%
> 1.500 1.000 won -33.33%
> 1.000 0.500 won -50.00%
> 0.000 0.500 lost +(was 0)
> 1.000 1.000 tied
> 1.500 1.000 won -33.33%
> 1.000 1.000 tied
> 1.000 1.000 tied
>
> won 6 times
> tied 3 times
> lost 1 times
>
> total unique fn went from 24 to 18 won -25.00%
> mean fn % went from 1.2 to 0.9 won -25.00%
>
> ham mean ham sdev
Note that means and sdevs aren't interesting when comparing Graham to
Robinson -- they're wildly different in their basic combining approach.
> ...
> Bottom line: Graham is better at avoiding FPs, Robinson is better at
> avoiding FNs.
This has everything to do with your spam_cutoff choice in the Robinson run.
Changing that necessarily improves one error rate at the expense of harming
the other. So if you *want* fewer f-p and more f-n, raise spam_cutoff. The
histograms can guide you without needing to rerun the test; increasing
nbuckets helps when there's significant overlap (changing nbuckets doesn't
change any scores, it just allows finer-grained analysis of what would
happen if you were to change spam_cutoff). See below.
> Great. At least there's an obvious middle ground with Robinson.
Indeed there is, and it's a good one. Unfortunately, exactly where it lies
seems quite corpus-dependent.
>
> and now Robinson (not nearly as pretty as Tim's, but still quite
> tolerable):
The last histos I posted covered 20,000 ham and 14,000 spam, and look much
more like normal distributions because of that. READ MY NOTES inside this
quote of your histograms:
> -> <stat> Ham scores for all runs: 2000 items; mean 20.62; sdev 9.21
> * = 5 items
> 0.00 0
> 2.50 24 *****
> 5.00 79 ****************
> 7.50 125 *************************
> 10.00 175 ***********************************
> 12.50 162 *********************************
> 15.00 199 ****************************************
> 17.50 245 *************************************************
> 20.00 239 ************************************************
> 22.50 219 ********************************************
> 25.00 147 ******************************
> 27.50 114 ***********************
> 30.00 85 *****************
> 32.50 52 ***********
> 35.00 42 *********
> 37.50 27 ******
> 40.00 15 ***
> 42.50 11 ***
> 45.00 12 ***
> 47.50 9 **
> 50.00 7 **
------------------------- 0.525 is your spam_cutoff
> 52.50 5 *
------------------------- if you raised it to 0.55, you'd lose 5 f-p
> 55.00 2 *
------------------------- and if you raised it to 0.575, 5+2 = 7 f-p
> 57.50 4 *
------------------------- and if you raised it to 0.6, 5+2+4 = 11 f-p
at which point you'd have only 1 f-p left
> 60.00 0
> 62.50 1 *
> 65.00 0
> 67.50 0
> 70.00 0
> 72.50 0
> 75.00 0
> 77.50 0
> 80.00 0
> 82.50 0
> 85.00 0
> 87.50 0
> 90.00 0
> 92.50 0
> 95.00 0
> 97.50 0
>
> -> <stat> Spam scores for all runs: 2000 items; mean 81.60; sdev 11.11
> * = 4 items
> 0.00 0
> 2.50 0
> 5.00 0
> 7.50 0
> 10.00 0
> 12.50 0
> 15.00 0
> 17.50 0
> 20.00 0
> 22.50 0
> 25.00 0
> 27.50 0
> 30.00 0
> 32.50 0
> 35.00 0
> 37.50 0
> 40.00 2 *
> 42.50 4 *
> 45.00 3 *
> 47.50 5 **
> 50.00 4 *
------------------------------------- your spam_cutoff
> 52.50 8 **
------------------------------------- raised to 0.55, you'd gain 8 f-n
> 55.00 15 ****
> 57.50 23 ******
> 60.00 31 ********
> 62.50 49 *************
> 65.00 78 ********************
> 67.50 114 *****************************
> 70.00 137 ***********************************
> 72.50 139 ***********************************
> 75.00 114 *****************************
> 77.50 112 ****************************
> 80.00 117 ******************************
> 82.50 139 ***********************************
> 85.00 162 *****************************************
> 87.50 190 ************************************************
> 90.00 213 ******************************************************
> 92.50 140 ***********************************
> 95.00 88 **********************
> 97.50 113 *****************************
> -> best cutoff for all runs: 0.525
> -> with 12 fp + 18 fn = 30 mistakes
>
> Chew on that, stat-boy!
I did <wink>. If you raised your spam_cutoff to 0.55, your error rates
would be almost identical across the two methods, except that the Robinson
scheme's f-p and f-n almost all live in a narrow band. You're almost
certainly not using the best value for robinson_probability_a. Note that
the default histogram analysis minimizes fp+fn. If you hate fp more than
you hate fn (and you've said elsewhere that you do), figure out exactly how
much more you hate it <wink>. Call that h. Then the histogram analysis
will minimize
h * fp + fn
instead if you set option best_cutoff_fp_weight to h. Increasing nbuckets
will help it make a finer-grained decision.
It would be good to try again with a better value for robinson_probability_a
(although part of this exercise is for you to get your data to tell you
which value works best for it, not to have someone else guess).