[Spambayes] RE: For the bold
Tim Peters
tim.one@comcast.net
Sat, 05 Oct 2002 19:32:11 -0400
[Rob Hooft]
> I am attaching my new version of clpik.py that implements my RMS
> Z-score ideas.
Cool! I'm going to check this into the project, but under the name
rmspik.py. People playing along: you DO NOT need to rerun a test to try
this! rmspik.py analyzes the binary pickle (clim.pik) left behind by
clgen.py (the central-limit analysis test driver), and very quickly (a
matter of seconds) determines exactly what would have happened had we used
Rob's RMS certainty rules instead.
> Some results I get are listed hereunder. I'm very interested to
> hear what other people get with this!
Here's a use_central_limit2 run with max_discriminators=50, trained on 5000
ham and 5000 spam, then predicting against 7500 of each:
-> <stat> Ham scores for all runs: 7500 items; mean 0.14; sdev 2.72
-> <stat> min 0; median 0; max 100
* = 123 items
0 7480 *************************************************************
25 18 *
50 1 *
75 1 *
-> <stat> Spam scores for all runs: 7500 items; mean 99.86; sdev 2.85
-> <stat> min 0; median 100; max 100
* = 123 items
0 2 *
25 1 *
50 16 *
75 7481 *************************************************************
Under rmspik,
Reading clim.pik ...
Nham= 7500
RmsZham= 2.27249107964
Nspam= 7500
RmsZspam= 2.354280998
======================================================================
HAM:
Sure/ok 7325
Unsure/ok 172
Unsure/not ok 3
Sure/not ok 0
Unsure rate = 2.33%
Sure fp rate = 0.00%; Unsure fp rate = 1.71%
======================================================================
SPAM:
FALSE NEGATIVE: zham=-2.39 zspam=-4.93 Data/Spam/Set7/99999.txt SURE!
Sure/ok 7422
Unsure/ok 75
Unsure/not ok 2
Sure/not ok 1
Unsure rate = 1.03%
Sure fn rate = 0.01%; Unsure fn rate = 2.60%
So RMS was unsure much more often, and especially unsure about ham. In the
end RMS had one more false positive (2 versus 3), but all 3 were in its
region of uncertainty. They both had 3 false negatives, but RMS had one
fewer in its region of certainty. The sole f-n it was certain about is also
one clim2 was certain about, and is a spam with a uuencoded body that we
don't decode. This is a tradeoff in the tokenizer: it simply doesn't
generate enough clues to nail this one (10 "words" total). It's especially
embarrassing because the subject line is
Subject: HOW TO BECOME A MILLIONAIRE IN WEEKS!!
Sheesh <wink>.
BTW, for python.org use, an uncertainty rate over 2% may not fly -- Greg
already gripes about reviewing a trivial number of msgs each day.
Now all over again, but with use_central_limit3; max_discriminators still
50, and same sets of msgs trained on and predicted against:
-> <stat> Ham scores for all runs: 7500 items; mean 0.05; sdev 1.61
-> <stat> min 0; median 0; max 51
* = 123 items
0 7492 *************************************************************
25 7 *
50 1 *
75 0
-> <stat> Spam scores for all runs: 7500 items; mean 99.63; sdev 4.43
-> <stat> min 0; median 100; max 100
* = 123 items
0 2 *
25 5 *
50 48 *
75 7445 *************************************************************
The uncertainty rate on ham is plain jaw-dropping there. It's less sure
about spam, but in the end makes the same "but I was certain" mistakes.
Let's see how rmspik does on it:
Reading clim.pik ...
Nham= 7500
RmsZham= 9.77605846416
Nspam= 7500
RmsZspam= 10.1887670936
======================================================================
HAM:
Sure/ok 7316
Unsure/ok 183
Unsure/not ok 1
Sure/not ok 0
Unsure rate = 2.45%
Sure fp rate = 0.00%; Unsure fp rate = 0.54%
======================================================================
SPAM:
FALSE NEGATIVE: zham=-2.32 zspam=-6.04 Data/Spam/Set7/99999.txt SURE!
Sure/ok 7269
Unsure/ok 225
Unsure/not ok 5
Sure/not ok 1
Unsure rate = 3.07%
Sure fn rate = 0.01%; Unsure fn rate = 2.17%
RMS's uncertainty about spam skyrocketed under this scheme, but it did a
little better on ham under this scheme (1 fp total versus 3 before). In
return, it has more fn (6 total vs 3 before).