[Spambayes] RE: spam detection via probability - actual results!

Tim Peters tim.one@comcast.net
Thu, 19 Sep 2002 23:40:27 -0400


I have an interesting result from something that shouldn't have been tried
(ahem -- but I had to take a nap, and wanted to run something while I
snoozed that took no more setup than a one-line change from a tired brain).

In the lingo of our codebase, this compares

[Classifier]
use_robinson_probability: True
[TestDriver]
spam_cutoff: 0.50

to

[Classifier]
use_robinson_probability: True
max_discriminators: 150
[TestDriver]
spam_cutoff: 0.50

That is, it's Gary's combining scheme but looking at the top 150
discriminators (MAX_DISCRIMINATORS in classifier.py, 16 by default).  It's
limited to 150 to guarantee that nothing can silently underflow to 0.0 in
the combining code exactly as it is; it would be more interesting to rework
the code so that no bound were needed (and Gary has special reasons for
wanting us to try that, but it requires other changes first to give it a
trial under the intended preconditions).

Boosting MAX_DISCRIMINATORS was a disaster under Graham's combining scheme.
Neil Schemenauer bit the logarithm bullet on Yet Another Combining Scheme of
his own devising, and recently reported much better results than that (but
still not as good as the all-default code clamped to 16).

The results here were surprising in several respects (note that this means I
was surprised, not necessarily that they're in any way surprising -- this is
quite possibly a statement more about me than the results <wink>).

First the summary results.  It didn't hurt a bit, although as we get deeper
into this it's even more surprising that it didn't hurt:

"""
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
   [ditto 19 times]

false positive percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.050  0.050  tied
    0.050  0.050  tied
    0.000  0.000  tied
    0.050  0.050  tied
    0.000  0.000  tied
    0.100  0.100  tied

won   0 times
tied 10 times
lost  0 times

total unique fp went from 5 to 5 tied
mean fp % went from 0.025 to 0.025 tied

false negative percentages
    0.218  0.145  won    -33.49%
    0.364  0.145  won    -60.16%
    0.000  0.000  tied
    0.218  0.218  tied
    0.218  0.218  tied
    0.291  0.291  tied
    0.218  0.218  tied
    0.145  0.218  lost   +50.34%
    0.291  0.291  tied
    0.073  0.000  won   -100.00%

won   3 times
tied  6 times
lost  1 times

total unique fn went from 28 to 24 won    -14.29%
mean fn % went from 0.203636363636 to 0.174545454546 won    -14.29%
"""

The real surprise is in the score histograms, which look much more like
normal distributions now, and the means for the ham and spam distributions
are much closer together now.  It's the combination of those two with that
the results *didn't* go to hell that's a surprise to me when taken
altogether:

Ham distribution for all runs:
* = 87 items
  0.00   81 *
  2.50   52 *
  5.00   67 *
  7.50  112 **
 10.00  165 **
 12.50  257 ***
 15.00  424 *****
 17.50  934 ***********
 20.00 2358 ****************************
 22.50 4705 *******************************************************
 25.00 5166 ************************************************************
 27.50 3238 **************************************
 30.00 1515 ******************
 32.50  546 *******
 35.00  221 ***
 37.50   91 **
 40.00   38 *
 42.50   14 *
 45.00    7 *
 47.50    4 *
 50.00    1 *
 52.50    0
 55.00    3 *
 57.50    0
 60.00    0
 62.50    1 *

That's another surprise:  the highest scoring ham is the fellow who added a
useless one-line comment to a quote of an entire Nigerian scam message.
That this *only* scores 0.625 is frightening.  Or encouraging, depending on
how you look at it <wink>.  It's getting actual benefit from containing
seemingly vanilla words like 'because' -- but so does a *real* Nigerian scam
now.  The few low-prob clues that this was sent by a real person are giving
it a *lot* of benefit:

prob('jeez,') = 0.01
prob('email addr:engineer.com') = 0.01
prob('header:X-Complaints-To:1') = 0.01
prob('wrote') = 0.01
prob('header:Organization:1') = 0.01
prob('flynn') = 0.01
prob('header:Errors-To:1') = 0.0194136

 65.00    0
 67.50    0
 70.00    0
 72.50    0
 75.00    0
 77.50    0
 80.00    0
 82.50    0
 85.00    0
 87.50    0
 90.00    0
 92.50    0
 95.00    0
 97.50    0

Spam distribution for all runs:
* = 37 items
  0.00    0
  2.50    0
  5.00    0
  7.50    0
 10.00    0
 12.50    0
 15.00    0
 17.50    0
 20.00    1 *
 22.50    0
 25.00    0
 27.50    0
 30.00    0
 32.50    0
 35.00    0
 37.50    0

Note that only one spam scored below 0.40.

 40.00    3 *
 42.50    5 *
 45.00    8 *
 47.50    7 *
 50.00   15 *
 52.50   32 *
 55.00   68 **
 57.50  127 ****
 60.00  291 ********
 62.50  542 ***************
 65.00 1035 ****************************
 67.50 1910 ****************************************************
 70.00 2180 ***********************************************************
 72.50 1926 *****************************************************
 75.00 1730 ***********************************************
 77.50 1278 ***********************************
 80.00  651 ******************
 82.50  302 *********
 85.00  349 **********
 87.50  189 ******
 90.00  140 ****
 92.50  158 *****
 95.00  238 *******
 97.50  565 ****************

Something without an explanation:  Gary had a report from someone else who
tried his combining scheme without bounding the number of words.  It did
substantially worse than when clamping to 15.  I know 150 isn't unbounded,
but I'm guessing there's not a heck of a lot of difference between 150 and
infinity here.  One side effect of boosting to 150 is that the list of top
discriminators becomes pretty much worthless; here are the top 10:

        'you' 16195 0.5
        'and' 17498 0.5
        'header:Errors-To:1' 18051 0.0200348
        'x-mailer:none' 19979 0.389823
        'the' 22084 0.5
        'header:Message-ID:1' 24311 0.364374
        'header:From:1' 26354 0.496583
        'header:Date:1' 26402 0.472034
        'header:To:1' 26463 0.489375
        'header:Subject:1' 26464 0.495639

IOW, when scoring 26,464 messages, the mere presence of a Subject line was
one of the 150 strongest discriminators.  This tells me that lots of times
we don't even have 150 distinct words to look at.  Lots of times we were
even reduced to looking at the 100% neutral "the", "and" and "you".

It's possible that the fellow who generated Gary's other result in this
direction wasn't aware of the potential underflow problems, and so got lots
of nonsense scores (the code is such that underflowing to 0 won't raise a
later exception; underflow can happen in either or both of P and Q; if they
both underflow, the score coming out in the end would be 0.5).


Now a question for Gary (hope you're still here <wink>), to help me
understand what's needed to do this right:  what, exactly, does it mean to
require that the spam probabilities be uniformly distributed?

Concrete and relevant example:  suppose I were to take the spamprobs exactly
as they are now, and merely round them to two significant decimal digits.
Then there would be exactly 99 distinct spamprobs in the system, uniformly
distributed in .01 through .99.  Is that all it takes to meet the formal
precondition?

If I normalize the existing probabilities instead based on rank (which I'm
happy to do), I have tens of thousands of words all with spamprob .01 now,
and also with spamprob .99 now.  Based on rank, then, assigning all ties to
a probability based on the median rank in an all-equal range would *still*
end up giving tens of thousands of words the same probabilities in the end.

So if simply rounding to two digits wouldn't satisfy the intended meaning of
"uniformly distributed", I'm wary that doing the ranked-based business
wouldn't actually satisfy it either.

Maybe the bottom-line question here is whether, to give "all words" a fair
trial, I really need also to change the way *initial* spamprob values are
computed too.