[Spambayes] There Can Be Only One
Tim Peters
tim.one@comcast.net
Wed, 25 Sep 2002 17:22:06 -0400
[Skip Montanaro]
> An now, coming in from faaaar out in right field we have Skip, the weirdo
> with the wacky results:
Speaking of which, are you going to follow up on
http://mail.python.org/pipermail-21/spambayes/2002-September/000211.html
?
> grahams -> fws
> -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
> ...
>
> false positive percentages
> 0.000 0.000 tied
> 0.000 0.500 lost +(was 0)
> 0.000 0.000 tied
> 0.000 0.000 tied
> 0.000 0.000 tied
> 0.000 0.000 tied
> 0.500 0.500 tied
> 0.000 0.000 tied
> 0.500 0.000 won -100.00%
> 0.000 0.000 tied
>
> won 1 times
> tied 8 times
> lost 1 times
>
> total unique fp went from 2 to 2 tied
> mean fp % went from 0.1 to 0.1 tied
So 0.1% is very low with this little training data.
> false negative percentages
> 9.000 10.500 lost +16.67%
> 11.000 14.000 lost +27.27%
> 10.000 12.000 lost +20.00%
> 7.000 8.500 lost +21.43%
> 14.000 19.500 lost +39.29%
> 12.000 12.000 tied
> 13.000 17.000 lost +30.77%
> 9.000 12.500 lost +38.89%
> 9.000 13.000 lost +44.44%
> 9.500 12.000 lost +26.32%
>
> won 0 times
> tied 1 times
> lost 9 times
>
> total unique fn went from 207 to 262 lost +26.57%
> mean fn % went from 10.35 to 13.1 lost +26.57%
And this is a supernaturally high increase. When the f-p rate is
supernaturally low, and the f-n rate supernaturally high, the system is too
willing to call everything ham. You histograms show that you have 56 false
negatives living in the 0.525-0.550 bucket, so dropping spam_cutoff to 0.525
would get your f-n rate near to where it was. Yours is the only report so
far where dropping spam_cutoff would help; others have reported that
increasing it would help; I suggested the value that was optimal for my run.
Also read the parent msg in this thread for other things you should try (the
f(w) scheme hasn't been tuned -- tuning it for your data ia part of the job
here).
If you cvs up, the histogram pairs are now followed by an account of the
spam_cutoff rates that minimize fp+fn. You can set another new option to
make that analysis look more kindly upon false negatives than false
positives (or vice versa). I recommend that you also set nbuckets to 100 or
200 (the default is 40), as you've got much denser overlap than other people
are seeing, and a change of 0.025 in spam_cutoff for you in the overlap
range is going to reclassify 50-100 messages. More buckets would allow
finer-grained analysis.
> ...
> Here are the overall graphs from the f(w) run:
>
> -> <stat> Ham scores for all runs: 2000 items; mean 22.67;
> sample sdev 7.57
> * = 5 items
> 0.00 5 *
> 2.50 6 **
> 5.00 16 ****
> 7.50 48 **********
> 10.00 80 ****************
> 12.50 121 *************************
> 15.00 195 ***************************************
> 17.50 277 ********************************************************
> 20.00 278 ********************************************************
> 22.50 260 ****************************************************
> 25.00 257 ****************************************************
> 27.50 168 **********************************
> 30.00 125 *************************
> 32.50 55 ***********
> 35.00 40 ********
> 37.50 20 ****
> 40.00 18 ****
> 42.50 12 ***
> 45.00 7 **
> 47.50 4 *
> 50.00 1 *
> 52.50 5 *
> 55.00 0
> 57.50 0
> 60.00 2 *
> 62.50 0
> 65.00 0
> 67.50 0
> 70.00 0
> 72.50 0
> 75.00 0
> 77.50 0
> 80.00 0
> 82.50 0
> 85.00 0
> 87.50 0
> 90.00 0
> 92.50 0
> 95.00 0
> 97.50 0
>
> -> <stat> Spam scores for all runs: 2000 items; mean 69.27;
> sample sdev 12.38
> * = 3 items
> 0.00 0
> 2.50 0
> 5.00 0
> 7.50 0
> 10.00 0
> 12.50 0
> 15.00 0
> 17.50 0
> 20.00 1 *
> 22.50 0
> 25.00 1 *
> 27.50 2 *
> 30.00 1 *
> 32.50 6 **
> 35.00 4 **
> 37.50 10 ****
> 40.00 23 ********
> 42.50 20 *******
> 45.00 37 *************
> 47.50 41 **************
> 50.00 60 ********************
> 52.50 56 *******************
> 55.00 70 ************************
> 57.50 107 ************************************
> 60.00 124 ******************************************
> 62.50 121 *****************************************
> 65.00 151 ***************************************************
> 67.50 163 *******************************************************
> 70.00 166 ********************************************************
> 72.50 169 *********************************************************
> 75.00 139 ***********************************************
> 77.50 126 ******************************************
> 80.00 102 **********************************
> 82.50 95 ********************************
> 85.00 79 ***************************
> 87.50 53 ******************
> 90.00 46 ****************
> 92.50 23 ********
> 95.00 4 **
> 97.50 0
>
> It looks to me like the problem (whatever it is) with my data is that it
> causes the standard deviation to get quite large, though the mean
> spam score seems to be a bit lower than other stuff I've seen. What would
> cause that? More variability in my spam?
Possibly, although that's hard to swallow since spam seems so much alike in
so many ways. Why would the spam you get be more variable than the spam
everyone else gets?
Another possibility is that your ham contains popular spam words. For
example, if you have significant ham that talks about diet and exercise,
Nigeria, or the size of assorted sexual organs, that would spread out your
spam scores. As mentioned at the top, the way to get an experienced eyeball
to look at this needs to be followed up on. You have an amazing percentage
of low-scoring spam!
Just for fun, try setting robinson_minimum_prob_strength to 0.4.