[Spambayes] RE: For the bold

Tim Peters tim.one@comcast.net
Sat, 05 Oct 2002 19:32:11 -0400


[Rob Hooft]
> I am attaching my new version of clpik.py that implements my RMS
> Z-score ideas.

Cool!  I'm going to check this into the project, but under the name
rmspik.py.  People playing along:  you DO NOT need to rerun a test to try
this!  rmspik.py analyzes the binary pickle (clim.pik) left behind by
clgen.py (the central-limit analysis test driver), and very quickly (a
matter of seconds) determines exactly what would have happened had we used
Rob's RMS certainty rules instead.

> Some results I get are listed hereunder. I'm very interested to
> hear what other people get with this!

Here's a use_central_limit2 run with max_discriminators=50, trained on 5000
ham and 5000 spam, then predicting against 7500 of each:

-> <stat> Ham scores for all runs: 7500 items; mean 0.14; sdev 2.72
-> <stat> min 0; median 0; max 100
* = 123 items
  0 7480 *************************************************************
 25   18 *
 50    1 *
 75    1 *

-> <stat> Spam scores for all runs: 7500 items; mean 99.86; sdev 2.85
-> <stat> min 0; median 100; max 100
* = 123 items
  0    2 *
 25    1 *
 50   16 *
 75 7481 *************************************************************

Under rmspik,

Reading clim.pik ...
Nham= 7500
RmsZham= 2.27249107964
Nspam= 7500
RmsZspam= 2.354280998
======================================================================
HAM:
Sure/ok       7325
Unsure/ok     172
Unsure/not ok 3
Sure/not ok   0
Unsure rate = 2.33%
Sure fp rate = 0.00%; Unsure fp rate = 1.71%
======================================================================
SPAM:
FALSE NEGATIVE: zham=-2.39 zspam=-4.93 Data/Spam/Set7/99999.txt SURE!
Sure/ok       7422
Unsure/ok     75
Unsure/not ok 2
Sure/not ok   1
Unsure rate = 1.03%
Sure fn rate = 0.01%; Unsure fn rate = 2.60%

So RMS was unsure much more often, and especially unsure about ham.  In the
end RMS had one more false positive (2 versus 3), but all 3 were in its
region of uncertainty.  They both had 3 false negatives, but RMS had one
fewer in its region of certainty.  The sole f-n it was certain about is also
one clim2 was certain about, and is a spam with a uuencoded body that we
don't decode.  This is a tradeoff in the tokenizer:  it simply doesn't
generate enough clues to nail this one (10 "words" total).  It's especially
embarrassing because the subject line is

    Subject: HOW TO BECOME A MILLIONAIRE IN WEEKS!!

Sheesh <wink>.

BTW, for python.org use, an uncertainty rate over 2% may not fly -- Greg
already gripes about reviewing a trivial number of msgs each day.


Now all over again, but with use_central_limit3; max_discriminators still
50, and same sets of msgs trained on and predicted against:

-> <stat> Ham scores for all runs: 7500 items; mean 0.05; sdev 1.61
-> <stat> min 0; median 0; max 51
* = 123 items
  0 7492 *************************************************************
 25    7 *
 50    1 *
 75    0

-> <stat> Spam scores for all runs: 7500 items; mean 99.63; sdev 4.43
-> <stat> min 0; median 100; max 100
* = 123 items
  0    2 *
 25    5 *
 50   48 *
 75 7445 *************************************************************

The uncertainty rate on ham is plain jaw-dropping there.  It's less sure
about spam, but in the end makes the same "but I was certain" mistakes.

Let's see how rmspik does on it:

Reading clim.pik ...
Nham= 7500
RmsZham= 9.77605846416
Nspam= 7500
RmsZspam= 10.1887670936
======================================================================
HAM:
Sure/ok       7316
Unsure/ok     183
Unsure/not ok 1
Sure/not ok   0
Unsure rate = 2.45%
Sure fp rate = 0.00%; Unsure fp rate = 0.54%
======================================================================
SPAM:
FALSE NEGATIVE: zham=-2.32 zspam=-6.04 Data/Spam/Set7/99999.txt SURE!
Sure/ok       7269
Unsure/ok     225
Unsure/not ok 5
Sure/not ok   1
Unsure rate = 3.07%
Sure fn rate = 0.01%; Unsure fn rate = 2.17%

RMS's uncertainty about spam skyrocketed under this scheme, but it did a
little better on ham under this scheme (1 fp total versus 3 before).  In
return, it has more fn (6 total vs 3 before).