[Spambayes] There Can Be Only One

Thu, 26 Sep 2002 09:54:38 -0400

On 25 September 2002, I said:
> On 25 September 2002, Tim Peters said:
> Yes, I've been running tests all afternoon and evening.  Vague,
> hand-wavey results:
> 
>   * my histograms are not terribly normal -- not as weird as Guido's,
>     but not nearly as nice as Tim's
>   * I think my peaks are better separated though -- there's a pretty
>     wide range for spam_cutoff
>   * I'm one of the few who seems to win by setting set spam_cutoff < 0.5

OK, here's some more detail.  First, my corpus: I had to cobble together
a few sources of email in order make it into the Big Leagues, ie. have
2000 spam to play with.  Here are my sources:

  * python.org Sept 2002 harvest:
      1895 spam
      3821 normal ham
      1662 dsn
      5483 total ham

  * spam destined for gward@python.net, but detected by SpamAssassin
    and set aside by my .procmailrc on starship.python.net, from
    2002-02-11 to 2002-08-07 (ie. from when I started using SA
    on starship until I replaced qmail with Exim).  That's 1580
    spams.

    There are two obvious artifacts of this spam collection: lots of
    messages have "To: gward@python.net", and all of them have
    "Received: ... by starship.python.net".  So I counterbalanced it
    with...

  * the contents of all my personal inboxes as of about noon yesterday
    -- 1333 hams.  These have the same two artifacts as the
    gward@python.net spam collection, which seems to have prevented
    eg. "To: gward@python.net" being a very good clue either way.
    But since they were sent on from starship to my various ISP
    accounts (3 or 4 variations over the time that this mail has
    been received), they were covered with "Received" headers
    giving that away.  So I removed all "Received" headers up to
    the "Received: ... by starship.python.net", which seems to
    have done the trick.  There are a few messages that were sent
    straight to one of my ISP addresses, but not many since I never
    use them publicly.

The biggest artifact remaining is that all of the gward-spam was
received by qmail, whereas most of the other mail was received by Exim.
This leaks into the "best discriminators" list:

        'received:0000' 87 0.655303
        'received:HELO' 96 0.725595
        'received:unknown' 96 0.746786

Almost every initial "Received" line in my gward-spam collection looks
like

  Received: from unknown (HELO yahoo.com) (218.232.230.20)
    by starship.python.net with SMTP; 28 Apr 2002 19:12:41 -0000

If Exim had received the same message, it would slap on something like
this instead:

  Received: from [218.232.230.20] (helo=yahoo.com)
          by starship.python.net with smtp (Exim 4.05)
          id 17sNGM-0004Ch-00
          for gward@python.net; Sun 28 Apr 2002 15:12:41 -0400

so the above three clues mean that "qmail receives more spam than ham"
-- because 100% (1580 msgs) of gward-spam is spam received by qmail, and
1077/1333 messages in gward-ham were received by qmail.

The flip side of the coin:

        'received:helo' 133 0.38154
        'received:for' 163 0.393567
        'received:4.05' 166 0.389087
        'received:0400' 167 0.397625
        'received:esmtp' 177 0.163725
        'received:Exim' 183 0.364903

mean "Exim receives a bit more ham than spam" -- unless the protocol is
ESMTP, which is a fairly good ham indicator.  That probably *is* a valid
clue, rather than a qmail-vs-Exim artifact.  Interesting.

Anyways, that little exploration has me wondering just how valid my data
is.  I should probably rerun everything without looking at "Received"
headers at all (except to count them -- for the most part, they stop at
either mail.python.org or starship.python.net, which are the front-line
servers for these two collections).

Right, onto the results.  First, I'm a little unclear on how everyone
else has been generating the results they've been posting.  I did this:

  [...several test runs...]
  timcv.py -n10 --ham=200 --spam=200 -s54321 > timcv-run5.log
  [...tweak .ini file a couple of times...]
  timcv.py -n10 --ham=200 --spam=200 -s54321 > timcv-run7.log

run5 is Graham, run7 is f(w) with spam_cutoff=0.475 (based on the
results of run6, which I'm not showing here).

Then I ran
  rates.py timcv-run5.log
  rates.py timcv-run7.log

and then
  cmp.py timcv-run5.logs.txt timcv-run7.logs.txt

which gives me this:

"""
timcv-run5.logs.txt -> timcv-run7.logs.txt
[...]

false positive percentages
    0.000  0.500  lost  +(was 0)
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.500  0.500  tied          
    0.000  0.000  tied          
    1.000  0.000  won   -100.00%
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.500  0.000  won   -100.00%
    1.000  1.000  tied          

won   2 times
tied  7 times
lost  1 times

total unique fp went from 6 to 4 won    -33.33%
mean fp % went from 0.3 to 0.2 won    -33.33%

false negative percentages
    2.000  2.000  tied          
    1.500  2.500  lost   +66.67%
    1.000  2.000  lost  +100.00%
    1.500  2.000  lost   +33.33%
    1.000  0.500  won    -50.00%
    0.500  1.500  lost  +200.00%
    1.000  1.000  tied          
    2.000  1.500  won    -25.00%
    1.000  1.000  tied          
    1.000  1.500  lost   +50.00%

won   2 times
tied  3 times
lost  5 times

total unique fn went from 25 to 31 lost   +24.00%
mean fn % went from 1.25 to 1.55 lost   +24.00%

ham mean                     ham sdev
   0.23   20.68 +8891.30%        3.20    8.87 +177.19%
   0.00   20.99 +(was 0)        0.00    9.40 +(was 0)
   0.00   20.02 +(was 0)        0.00    8.17 +(was 0)
   0.50   21.70 +4240.00%        7.03    9.84  +39.97%
   0.00   20.59 +(was 0)        0.00    8.91 +(was 0)
   0.99   20.70 +1990.91%        9.89   10.06   +1.72%
   0.00   19.02 +(was 0)        0.00    8.47 +(was 0)
   0.00   19.41 +(was 0)        0.00    8.65 +(was 0)
   0.50   18.99 +3698.00%        7.05    7.96  +12.91%
   1.00   21.73 +2073.00%        9.95   10.64   +6.93%

ham mean and sdev for all runs
   0.32   20.38 +6268.75%        5.55    9.18  +65.41%

spam mean                    spam sdev
  98.41   80.22  -18.48%       11.97   11.30   -5.60%
  98.48   79.66  -19.11%       12.15   11.51   -5.27%
  99.00   79.41  -19.79%        9.95   12.24  +23.02%
  98.50   79.70  -19.09%       12.16   11.44   -5.92%
  99.00   79.93  -19.26%        9.95   11.49  +15.48%
  99.50   79.48  -20.12%        7.05   10.94  +55.18%
  99.00   80.18  -19.01%        9.95   11.01  +10.65%
  98.00   78.66  -19.73%       14.00   11.04  -21.14%
  99.00   78.52  -20.69%        9.95   11.55  +16.08%
  99.00   79.80  -19.39%        9.95   11.25  +13.07%

spam mean and sdev for all runs
  98.79   79.56  -19.47%       10.87   11.40   +4.88%

ham/spam mean difference: 98.47 59.18 -39.29
"""

Here are the histograms for run5 (Graham):

* = 34 items
  0.00 1993 ***********************************************************
  2.50    0 
  5.00    0 
  7.50    0 
 10.00    0 
 12.50    0 
 15.00    0 
 17.50    0 
 20.00    0 
 22.50    0 
 25.00    0 
 27.50    0 
 30.00    0 
 32.50    0 
 35.00    0 
 37.50    0 
 40.00    0 
 42.50    0 
 45.00    1 *
 47.50    0 
 50.00    0 
 52.50    0 
 55.00    0 
 57.50    0 
 60.00    0 
 62.50    0 
 65.00    0 
 67.50    0 
 70.00    0 
 72.50    0 
 75.00    0 
 77.50    0 
 80.00    0 
 82.50    0 
 85.00    0 
 87.50    0 
 90.00    0 
 92.50    0 
 95.00    0 
 97.50    6 *

-> <stat> Spam scores for all runs: 2000 items; mean 98.79; sdev 10.87
* = 33 items
  0.00   23 *
  2.50    0 
  5.00    1 *
  7.50    0 
 10.00    0 
 12.50    0 
 15.00    0 
 17.50    0 
 20.00    0 
 22.50    0 
 25.00    0 
 27.50    0 
 30.00    0 
 32.50    0 
 35.00    0 
 37.50    0 
 40.00    0 
 42.50    0 
 45.00    0 
 47.50    0 
 50.00    0 
 52.50    0 
 55.00    0 
 57.50    0 
 60.00    0 
 62.50    0 
 65.00    0 
 67.50    0 
 70.00    0 
 72.50    0 
 75.00    0 
 77.50    0 
 80.00    0 
 82.50    1 *
 85.00    0 
 87.50    0 
 90.00    1 *
 92.50    0 
 95.00    1 *
 97.50 1973 ************************************************************

and for run7 (Robinson f(w)):

-> <stat> Ham scores for all runs: 2000 items; mean 20.38; sdev 9.18
* = 5 items
  0.00   0 
  2.50  41 *********
  5.00  96 ********************
  7.50 144 *****************************
 10.00 148 ******************************
 12.50 146 ******************************
 15.00 206 ******************************************
 17.50 202 *****************************************
 20.00 253 ***************************************************
 22.50 212 *******************************************
 25.00 173 ***********************************
 27.50 124 *************************
 30.00  77 ****************
 32.50  51 ***********
 35.00  41 *********
 37.50  26 ******
 40.00  13 ***
 42.50  13 ***
 45.00  11 ***
 47.50  11 ***
 50.00   4 *
 52.50   4 *
 55.00   1 *
 57.50   3 *
 60.00   0 
 62.50   0 
 65.00   0 
 67.50   0 
 70.00   0 
 72.50   0 
 75.00   0 
 77.50   0 
 80.00   0 
 82.50   0 
 85.00   0 
 87.50   0 
 90.00   0 
 92.50   0 
 95.00   0 
 97.50   0 

-> <stat> Spam scores for all runs: 2000 items; mean 79.56; sdev 11.40
* = 3 items
  0.00   0 
  2.50   0 
  5.00   0 
  7.50   0 
 10.00   0 
 12.50   0 
 15.00   0 
 17.50   0 
 20.00   0 
 22.50   0 
 25.00   0 
 27.50   0 
 30.00   0 
 32.50   0 
 35.00   0 
 37.50   2 *
 40.00   1 *
 42.50   5 **
 45.00   4 **
 47.50   6 **
 50.00   6 **
 52.50   7 ***
 55.00  18 ******
 57.50  41 **************
 60.00  47 ****************
 62.50  68 ***********************
 65.00 105 ***********************************
 67.50 115 ***************************************
 70.00 160 ******************************************************
 72.50 141 ***********************************************
 75.00 149 **************************************************
 77.50 128 *******************************************
 80.00 118 ****************************************
 82.50 154 ****************************************************
 85.00 159 *****************************************************
 87.50 171 *********************************************************
 90.00 119 ****************************************
 92.50  85 *****************************
 95.00  75 *************************
 97.50 116 ***************************************
-> best cutoff for all runs: 0.5
->     with 12 fp + 18 fn = 30 mistakes

Oops, just noticed the "best cutoff for all runs" thing.  I must have
misinterpreted the run6 output -- picking 0.475 was an eyeball average.
D'ohh.

Off to re-run things, without mine_received and with a better
spam_cutoff.

        Greg
-- 
Greg Ward <gward@python.net>                         http://www.gerg.ca/
A closed mouth gathers no foot.