[Spambayes] There Can Be Only One
Tim Peters
tim.one@comcast.net
Wed, 25 Sep 2002 22:01:10 -0400
[Guido, tries Gary's scheme with max_discrimators at 150, and then at 15]
> ...
> Ah, and here are the results for md=15 (left == md=150, right == md=15):
>
> false positive percentages
> 0.500 0.000 won -100.00%
> 0.000 0.000 tied
> 0.000 0.500 lost +(was 0)
> 0.000 1.000 lost +(was 0)
> 0.500 1.000 lost +100.00%
> 0.500 0.500 tied
> 0.000 0.500 lost +(was 0)
> 0.500 1.500 lost +200.00%
> 0.000 0.000 tied
> 1.000 1.000 tied
>
> won 1 times
> tied 4 times
> lost 5 times
>
> total unique fp went from 6 to 12 lost +100.00%
> mean fp % went from 0.3 to 0.6 lost +100.00%
>
> false negative percentages
> 0.500 0.000 won -100.00%
> 0.500 0.500 tied
> 0.500 0.500 tied
> 1.000 0.500 won -50.00%
> 1.000 0.000 won -100.00%
> 1.000 1.000 tied
> 1.000 0.500 won -50.00%
> 0.000 0.000 tied
> 0.500 0.500 tied
> 2.000 1.500 won -25.00%
>
> won 5 times
> tied 5 times
> lost 0 times
>
> total unique fn went from 16 to 10 won -37.50%
> mean fn % went from 0.8 to 0.5 won -37.50%
>
> The histograms look totally different here though!
That part isn't surprising -- this is making it look at only two handfuls of
*the* most extreme words in a msg, just like the Graham scheme does. "Only
extremes in, only extremes out" applies here too, although not as viciously
under Gary's combining scheme as under Graham's.
> -> <stat> Ham scores for all runs: 2000 items; mean 11.01; sample
> sdev 15.30
> * = 21 items
> 0.00 1201 **********************************************************
> 2.50 53 ***
> 5.00 9 *
> 7.50 7 *
> 10.00 2 *
> 12.50 29 **
> 15.00 87 *****
> 17.50 100 *****
> 20.00 67 ****
> 22.50 48 ***
> 25.00 54 ***
> 27.50 54 ***
> 30.00 41 **
> 32.50 49 ***
> 35.00 46 ***
> 37.50 29 **
> 40.00 23 **
> 42.50 30 **
> 45.00 19 *
> 47.50 16 *
> 50.00 9 *
> 52.50 7 *
> 55.00 2 *
> 57.50 6 *
> 60.00 5 *
> 62.50 1 *
> 65.00 1 *
> 67.50 1 *
> 70.00 0
> 72.50 0
> 75.00 0
> 77.50 2 *
> 80.00 0
> 82.50 0
> 85.00 1 *
> 87.50 0
> 90.00 0
> 92.50 0
> 95.00 1 *
> 97.50 0
>
> -> <stat> Spam scores for all runs: 2000 items; mean 95.00;
> sample sdev 7.86
> * = 21 items
> [...]
> 47.50 1 *
> 50.00 1 *
> 52.50 1 *
> 55.00 3 *
> 57.50 4 *
> 60.00 8 *
> 62.50 9 *
> 65.00 11 *
> 67.50 15 *
> 70.00 13 *
> 72.50 15 *
> 75.00 23 **
> 77.50 51 ***
> 80.00 65 ****
> 82.50 41 **
> 85.00 2 *
> 87.50 9 *
> 90.00 12 *
> 92.50 66 ****
> 95.00 425 *********************
> 97.50 1225 ***********************************************************
>
> Note the hams scoring all the way in the 90s.
We see that under Graham's scheme too (else there would never be false
positives there -- we use spam_cutoff 0.90 there); extremes in, etc.
> There are no spams here!
I didn't catch the meaning. You had 1728 (12+66+425+1225) spam "scoring all
the way in the 90s", so "here" must refer to something else?