[Spambayes] There Can Be Only One

Wed, 25 Sep 2002 17:22:06 -0400

[Skip Montanaro]
> An now, coming in from faaaar out in right field we have Skip, the weirdo
> with the wacky results:

Speaking of which, are you going to follow up on

http://mail.python.org/pipermail-21/spambayes/2002-September/000211.html

?

>     grahams -> fws
>     -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
>     ...
>
>     false positive percentages
>         0.000  0.000  tied
>         0.000  0.500  lost  +(was 0)
>         0.000  0.000  tied
>         0.000  0.000  tied
>         0.000  0.000  tied
>         0.000  0.000  tied
>         0.500  0.500  tied
>         0.000  0.000  tied
>         0.500  0.000  won   -100.00%
>         0.000  0.000  tied
>
>     won   1 times
>     tied  8 times
>     lost  1 times
>
>     total unique fp went from 2 to 2 tied
>     mean fp % went from 0.1 to 0.1 tied

So 0.1% is very low with this little training data.

>     false negative percentages
>         9.000  10.500  lost   +16.67%
>         11.000  14.000  lost   +27.27%
>         10.000  12.000  lost   +20.00%
>         7.000  8.500  lost   +21.43%
>         14.000  19.500  lost   +39.29%
>         12.000  12.000  tied
>         13.000  17.000  lost   +30.77%
>         9.000  12.500  lost   +38.89%
>         9.000  13.000  lost   +44.44%
>         9.500  12.000  lost   +26.32%
>
>     won   0 times
>     tied  1 times
>     lost  9 times
>
>     total unique fn went from 207 to 262 lost   +26.57%
>     mean fn % went from 10.35 to 13.1 lost   +26.57%

And this is a supernaturally high increase.  When the f-p rate is
supernaturally low, and the f-n rate supernaturally high, the system is too
willing to call everything ham.  You histograms show that you have 56 false
negatives living in the 0.525-0.550 bucket, so dropping spam_cutoff to 0.525
would get your f-n rate near to where it was.  Yours is the only report so
far where dropping spam_cutoff would help; others have reported that
increasing it would help; I suggested the value that was optimal for my run.

Also read the parent msg in this thread for other things you should try (the
f(w) scheme hasn't been tuned -- tuning it for your data ia part of the job
here).

If you cvs up, the histogram pairs are now followed by an account of the
spam_cutoff rates that minimize fp+fn.  You can set another new option to
make that analysis look more kindly upon false negatives than false
positives (or vice versa).  I recommend that you also set nbuckets to 100 or
200 (the default is 40), as you've got much denser overlap than other people
are seeing, and a change of 0.025 in spam_cutoff for you in the overlap
range is going to reclassify 50-100 messages.  More buckets would allow
finer-grained analysis.

> ...
> Here are the overall graphs from the f(w) run:
>
>     -> <stat> Ham scores for all runs: 2000 items; mean 22.67;
> sample sdev 7.57
>     * = 5 items
>       0.00   5 *
>       2.50   6 **
>       5.00  16 ****
>       7.50  48 **********
>      10.00  80 ****************
>      12.50 121 *************************
>      15.00 195 ***************************************
>      17.50 277 ********************************************************
>      20.00 278 ********************************************************
>      22.50 260 ****************************************************
>      25.00 257 ****************************************************
>      27.50 168 **********************************
>      30.00 125 *************************
>      32.50  55 ***********
>      35.00  40 ********
>      37.50  20 ****
>      40.00  18 ****
>      42.50  12 ***
>      45.00   7 **
>      47.50   4 *
>      50.00   1 *
>      52.50   5 *
>      55.00   0
>      57.50   0
>      60.00   2 *
>      62.50   0
>      65.00   0
>      67.50   0
>      70.00   0
>      72.50   0
>      75.00   0
>      77.50   0
>      80.00   0
>      82.50   0
>      85.00   0
>      87.50   0
>      90.00   0
>      92.50   0
>      95.00   0
>      97.50   0
>
>     -> <stat> Spam scores for all runs: 2000 items; mean 69.27;
> sample sdev 12.38
>     * = 3 items
>       0.00   0
>       2.50   0
>       5.00   0
>       7.50   0
>      10.00   0
>      12.50   0
>      15.00   0
>      17.50   0
>      20.00   1 *
>      22.50   0
>      25.00   1 *
>      27.50   2 *
>      30.00   1 *
>      32.50   6 **
>      35.00   4 **
>      37.50  10 ****
>      40.00  23 ********
>      42.50  20 *******
>      45.00  37 *************
>      47.50  41 **************
>      50.00  60 ********************
>      52.50  56 *******************
>      55.00  70 ************************
>      57.50 107 ************************************
>      60.00 124 ******************************************
>      62.50 121 *****************************************
>      65.00 151 ***************************************************
>      67.50 163 *******************************************************
>      70.00 166 ********************************************************
>      72.50 169 *********************************************************
>      75.00 139 ***********************************************
>      77.50 126 ******************************************
>      80.00 102 **********************************
>      82.50  95 ********************************
>      85.00  79 ***************************
>      87.50  53 ******************
>      90.00  46 ****************
>      92.50  23 ********
>      95.00   4 **
>      97.50   0
>
> It looks to me like the problem (whatever it is) with my data is that it
> causes the standard deviation to get quite large, though the mean
> spam score seems to be a bit lower than other stuff I've seen.  What would
> cause that?  More variability in my spam?

Possibly, although that's hard to swallow since spam seems so much alike in
so many ways.  Why would the spam you get be more variable than the spam
everyone else gets?

Another possibility is that your ham contains popular spam words.  For
example, if you have significant ham that talks about diet and exercise,
Nigeria, or the size of assorted sexual organs, that would spread out your
spam scores.  As mentioned at the top, the way to get an experienced eyeball
to look at this needs to be followed up on.  You have an amazing percentage
of low-scoring spam!

Just for fun, try setting robinson_minimum_prob_strength to 0.4.