[Spambayes] Moving closer to Gary's ideal

Sun, 22 Sep 2002 00:14:30 -0400

[Guido van Rossum, on
    [TestDriver]
    spam_cutoff: 0.575
]

> It's too soon for me to say for sure, but from a previous run that
> very same cutoff also looks like it would be a winner for my corpus!

You can tell for sure just by looking at the score histograms and counting
the dots <wink>; there's no need to change spam_cutoff and then rerun the
test (spam_cutoff has no effect on the scores computed); I've walked through
that process in slow motion several times on the list now.

> Neil's corpus likes a similar value (0.56).

Until he used the new robinson_minimum_prob_strength option, and then the
best value appeared to creep up to 0.60 for him.

> What does this mean?

I don't know what this number means.  Paul said his outputs were
probabilities, but that wasn't so.  Gary has been suitably careful not to
claim anything about what his outputs "mean", beyond that

    the product of the probabilities is monotonic with the Fisher
    inverse chi-square combined probability technique from meta-analysis

The Gary-scores certainly don't act like probabilities either, although
we're seeing over and over that once you know the best cutoff point, a
significant number of the false positives and false negatives score within a
small distance of it.

> That the spam bell curve is narrower than the ham bell curve?

Become friends with the score histograms!  They were useless with the
Graham-like scheme, because the ham histogram approximated a solid bar at
0.0, and the spam histogram a solid bar at 1.0.  Under Gary's scheme, they
look normally distributed, but with relatively long lopsided tails dribbling
toward each other.  Here's the non-zero slice of my ham histogram:

 20.00   26 *
 22.50  155 **
 25.00  627 ********
 27.50 1859 **********************
 30.00 3780 ********************************************
 32.50 5108 ************************************************************
 35.00 4264 **************************************************
 37.50 2450 *****************************
 40.00 1056 *************
 42.50  395 *****
 45.00  178 ***
 47.50   52 *
 50.00   30 *
 52.50   13 *
 55.00    4 *
 57.50    1 *
 60.00    1 *
 62.50    1 *

By inspecition, the mean is clearly close to 35.  For the spam,

 42.50    1 *
 45.00    0
 47.50    0
 50.00    3 *
 52.50    6 *
 55.00   17 *
 57.50   40 *
 60.00   76 **
 62.50  171 ****
 65.00  394 ********
 67.50  710 ***************
 70.00 1247 *************************
 72.50 2358 ************************************************
 75.00 2986 ************************************************************
 77.50 2659 ******************************************************
 80.00 1957 ****************************************
 82.50 1069 **********************
 85.00  192 ****
 87.50   31 *
 90.00   61 **
 92.50   22 *

The mean is more like 77 there.  The ham distribution actually looks tighter
than the spam distribution, but the mean of the spam is about 25 points
above 50 while the mean of the ham only about 15 points below 50.  I expect
it's this lopsidedness (wrt the midpoint) that makes a cutoff above 50 more
suitable.

An observed effect of setting robinson_minimum_prob_strength is to increase
the separation of the ham and spam means:  the ham mean gets lower and the
spam mean gets higher.  This is what I expected, since, unlike as in
Graham's scheme, scoring words with neutral probability in Gary's scheme
drags a score closer to 0.5.  Now "drags" sounds pejorative, because that's
the way I feel about it -- I see no value in scoring neutral words at all in
this task.  Gary disagrees, but allows that it's more of a "purist" issue
than a pragmatic one.  However, something we agree 100% on is that measuring
the effects of *principled* changes gets much harder if pragmatic hacks
muddy the mathematical basis of a scheme.  If Gary's scheme proves to be as
good as, but no better than, our current scheme, I'd still switch to it for
this reason:  it has far fewer "mystery knobs" to confuse the underlying
issues.

> (Hm, have you computed mean and standard deviation?)

Nope.  What would you do with them if I did (they're easy enough to compute
and display if there's a point to it)?  You can get an excellent feel for
them by looking at the histograms (which reveal far more than a pair of
(mean, sdev) numbers anyway).