[Spambayes] Moving closer to Gary's ideal

Sun, 22 Sep 2002 05:15:51 -0400

[Guido]
> ...
> Ham distribution for all runs:
> * = 268 items
>   5.00     2 *
>   7.50     0
>  10.00     4 *
>  12.50    64 *
>  15.00   262 *
>  17.50   748 ***
>  20.00  1734 *******
>  22.50  3665 **************
>  25.00  7662 *****************************
>  27.50 12597 ************************************************
>  30.00 16039 ************************************************************
>  32.50 14821 ********************************************************
>  35.00 10542 ****************************************
>  37.50  6441 *************************
>  40.00  3593 **************
>  42.50  2073 ********
>  45.00  1177 *****
>  47.50   717 ***
>  50.00   421 **
>  52.50   197 *
>  55.00    96 *
>  57.50    60 *
>  60.00    16 *
>  62.50    10 *
>  65.00     3 *
>  67.50     1 *
>  70.00     1 *
>  72.50     2 *
>  75.00     8 *
>
> So to match the 40 fp's from Graham's scheme, I'd need to set the
> cutoff to 0.60; that would give me 41 fp's here (16+10+3+1+1+2+8).

Right!  BTW, that's a huge test run -- certainly the largest to date.

> (If a message scores *exactly* the cutoff, is it spam or ham?)

>From Options.py:

    # A message is considered spam iff it scores greater than spam_cutoff.

Since I wrote that, you can trust that "greater than" does not mean "greater
than or equal to" <wink>.

Especially under Gary's scheme, the odds of scoring exactly at a cutoff
point are vanishingly small.  It *used* to be that a msg the email package
couldn't parse scored exactly 0.50, but we fall back to the raw text now.
If the raw text is empty, it will still get exactly 0.50.

> Spam distribution for all runs:
> * = 124 items
>  37.50    1 *
>  40.00    0
>  42.50    0
>  45.00    4 *
>  47.50   12 *
>  50.00   22 *
>  52.50   20 *
>  55.00   70 *
>  57.50  153 **
>  60.00  346 ***
>  62.50  664 ******
>  65.00 1334 ***********
>  67.50 2503 *********************
>  70.00 4303 ***********************************
>  72.50 7136 **********************************************************
>  75.00 7385 ************************************************************
>  77.50 4918 ****************************************
>  80.00 2481 *********************
>  82.50  513 *****
>  85.00  116 *
>  87.50   73 *
>  90.00   14 *
>
> So with a cutoff of 0.60, this would give me 1+4+12+22+20+70+153 = 282
> fn's.  That's still considerably worse than Graham's 204.

Indeed it is.  Note that I spent an enormous amount of time tweaking an
enormous number of things to get supernatural results out of Graham's
scheme, though, and none of those things have been revisited under Gary's
scheme yet.  The first thing I thought to try didn't make sense to try under
Graham:  leaving close-to-0.5 guys out of the scoring, and that's been
reported to alleviate the very problem you're seeing (higher f-n rate).
There may well be a better way to get there, but we won't know until people
*try* other things.  In particular, the treatment of rare corpus words is
very different under the default "a" parameter of 1.0 in Gary's scheme than
it is under the default all-Graham scheme, and as explained recently in a
different msg, we already know for certain that the treatment of rare words
has a significant effect on error rates.

> I'm going to have to look at the fp's and fn's to see if there are
> real spams hiding in the ham, and vice versa.

If there aren't, you'll be the first tester ever not to discover some.  For
example, I've found 3 hams in BruceG's spam collection so far, and I believe
you're using that too (but much more of it).

BTW, it's my belief that this all works *best* if the ratio of ham to spam
trained on matches your real-life inbox ratio.  But it's a peculiar property
of the cross-fold validation framework that the ratio of ham to spam
predicted on exactly matches the ratio of ham to spam trained on, so
inbalance in ham-vs-spam numbers can't be blamed for poor *test* results
(although *both* your error rates are under 1%, and I doubt any project save
ours would call that "poor" <wink>).

> I did notice that many fp's were very spammish automated postings
> that I have specifically signed up for, like our building's
> announcements, product newsletters, and so on.  I haven't looked at
> the fn's.

I expect these are your moral equivalents to the conference announcements in
my c.l.py ham, except worse.  However, I expect you have more cause for
optimism about those:  you (like me) are running a crippled version of the
algorithm because of your mixed-source corpora.  The headers we're ignoring
are bound to have strong clues about the *senders* of the spammish stuff
you've signed up for.

> I was interested in which of the two bell curves was "fatter".  The
> sdev tells me this.

Enough already <wink>.  Next time you do a cvs up, you'll get a nice
printout of total data points, mean, and sdev with each histogram.

> But I'm not sure that the relative fatness of the curves is a good
> measure

It depends on what you're trying to figure out, of course.

> -- it's the overlapping tails.

For "middle ground" purposes, absolutely.  Wherever they overlap, changing
spam_cutoff necessarily hurts one of the error rates and necessarily helps
the other.

> I suppose there's a statistical measure for how "normal" a tail is,
> but I'm not sure that's relevant given that we can easily see the
> overlap in the histograms.

It would be nicer to display them side-by-side, but you'd have to learn how
to use more of your available screen area then <wink>.