[Spambayes] Testing methodology

Tim Peters tim@zope.com
Fri, 13 Sep 2002 21:53:27 -0400


This is a multi-part message in MIME format.

---------------------- multipart/mixed attachment
[double heh!  now python.org blacklisted mu email address for "sending
 spam", so the second attempt also bounced -- third time's the charm?]

[heh -- python.org rejected this mail because of a spam example
 buried near the bottom; you'll have to dig it out of the attached
 zip file if you want to see it]


[Tim]
> With contempt for compatibility <wink>, I'd like to switch to using k-
> fold cross validation for evaluating test results. ...

I've checked in changes for that now, and I hope I *didn't* lose
compatibility in mboxtest.py.  If I did, it's shallow, provided you know how
every stick of this works <wink>.  If I screwed up there, scream and I'll
help fix it.

Note that the format of statistics lines printed by Driver() have changed,
and rates.py and cmp.py changed accordingly.  See the checkin comments for
more.

I haven't chnaged my directories around yet to exploit the new c-v code.
Instead I'm running a 5-fold c-v (timcv.py -n5) on my current setup as an
experiment now.  That means it's training on 16000 ham and about 11000 spam,
then predicting against 4000 disjoint ham and 2750 disjoint spam.  Repeat 5
times.  Note that this should actually run substantially faster than my old
5x5 grid, for reasons explained last time around (and, indeed, 60% of the
full run has completed since I started typing this).

The first thing I note is good news:  despite that it's training on 4x as
many msgs, a pickle of the classifier grows only by a factor of 2.4.  That
means we're getting a lot of words in common.  But it also means we're
getting a lot of unique words, which was thoroughly predictable and is why
GrahamBayes has had a clearjunk() method since its first day (albeit that
I've *still* never tried it <wink>).

16K ham + 11K spam is a huge amount of training data.  The point of c-v is
that we should be able to get good results with much less.  However, as I'm
watching this run, it occurs to me that if the number of msgs predicted is
N, no scheme can possibly measure an error rate smaller than 1/N.  Grumble.
This wasn't a problem in speech recognition because there was no chance we'd
get error rates as low as this guy is getting ...

And the run is done!  Here are the pickle sizes, each representing about
27000 msgs, or about 600 bytes/msg:

16469934 class1.pik
16521522 class2.pik
16362901 class3.pik
16289795 class4.pik
16465748 class5.pik

The variance is very low.

Test results:

-> <stat> tested 4000 hams & 2750 spams against 16000 hams & 11002 spams
-> <stat> false positive %: 0.025
-> <stat> false negative %: 0.327272727273
-> <stat> 1 new false positives
-> <stat> 9 new false negatives

-> <stat> tested 4000 hams & 2750 spams against 16000 hams & 11002 spams
-> <stat> false positive %: 0.0
-> <stat> false negative %: 0.254545454545
-> <stat> 0 new false positives
-> <stat> 7 new false negatives

-> <stat> tested 4000 hams & 2750 spams against 16000 hams & 11002 spams
-> <stat> false positive %: 0.0
-> <stat> false negative %: 0.181818181818
-> <stat> 0 new false positives
-> <stat> 5 new false negatives

-> <stat> tested 4000 hams & 2751 spams against 16000 hams & 11001 spams
-> <stat> false positive %: 0.025
-> <stat> false negative %: 0.181752090149
-> <stat> 1 new false positives
-> <stat> 5 new false negatives

-> <stat> tested 4000 hams & 2751 spams against 16000 hams & 11001 spams
-> <stat> false positive %: 0.0
-> <stat> false negative %: 0.436205016358
-> <stat> 0 new false positives
-> <stat> 12 new false negatives

total unique false pos 2
total unique false neg 38

average fp % 0.01
average fn % 0.276318694029

So that's 2 fp out of 20000 ham, and 38 fn out of 13752 spam.  The reason
these are better than "my usual" results is almost certainly just because
there's 4x as much training data on each run.

One of the fps is the fellow who quoted the entire Nigerian scam msg.  The
other is poor Vickie Mills, still trying to find a Python training course in
the UK <wink>, and still getting killed by her employer's obnoxious sig.
That's all of 'em!

At least one of the f-n's is a ham by anyone's definition:  it's a technical
discussion of difficulties with Unicode, sent to the Saskatoon Linux Group
Mailing List.  BruceG may not have wanted to read it, and I know I sure
didn't <wink>, but it's straight nerd talk and not trying to sell anything.
Another f-n is output from a cron job, with subject line

    Subject: Cron <bruce@lorien> run-parts cron/hourly

and indeed looks like email he arranged to send to himself, summarizing
email statistics on a spam collection address!  I'm going to take both of
these out of the spam set.

As always, some of the f-n's remain incredibly lame -- you look at these and
marvel it catches *any* spam (let alone more then 99.5% of it).  Just one
example to make your day <wink>:  [dig it out of the attached zip file]

Finally, aggregate distribution histograms for this test:

Ham distribution for all runs:
* = 334 items
  0.00 19995 ************************************************************
  2.50     0
  5.00     1 *
  7.50     0
 10.00     0
 12.50     0
 15.00     0
 17.50     0
 20.00     0
 22.50     0
 25.00     0
 27.50     0
 30.00     0
 32.50     0
 35.00     1 *
 37.50     0
 40.00     0
 42.50     0
 45.00     0
 47.50     0
 50.00     0
 52.50     0
 55.00     0
 57.50     0
 60.00     0
 62.50     0
 65.00     0
 67.50     0
 70.00     0
 72.50     0
 75.00     1 *
 77.50     0
 80.00     0
 82.50     0
 85.00     0
 87.50     0
 90.00     0
 92.50     0
 95.00     0
 97.50     2 *

Spam distribution for all runs:
* = 229 items
  0.00    31 *
  2.50     0
  5.00     0
  7.50     1 *
 10.00     0
 12.50     0
 15.00     1 *
 17.50     0
 20.00     0
 22.50     0
 25.00     0
 27.50     0
 30.00     0
 32.50     0
 35.00     0
 37.50     0
 40.00     0
 42.50     0
 45.00     1 *
 47.50     1 *
 50.00     1 *
 52.50     1 *
 55.00     0
 57.50     0
 60.00     0
 62.50     0
 65.00     0
 67.50     0
 70.00     0
 72.50     0
 75.00     0
 77.50     0
 80.00     0
 82.50     0
 85.00     0
 87.50     1 *
 90.00     1 *
 92.50    10 *
 95.00     1 *
 97.50 13702 ************************************************************

---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: spam.zip
Type: application/x-zip-compressed
Size: 1974 bytes
Desc: not available
Url : http://mail.python.org/pipermail-21/spambayes/attachments/20020913/f7f1fee9/spam.bin

---------------------- multipart/mixed attachment--