[Spambayes] For the bold

Fri, 04 Oct 2002 15:11:11 -0400

I checked in enough stuff so that bold experimenters can play with the
central-limit schemes, but not yet enough so that I (and Rob, bless his
heart) can get a detailed picture of what's going on under the covers
(patience, please).

You CANNOT use a cross-validation test with these schemes.  So don't use
timcv or mboxtest.  timtest is fine, or any other grid driver (are there
any?).  I believe I'll need to whip up a custom driver for deeper analysis
to make progress.

You CANNOT meaningfully compare error rates between a cross-validation
driver and a grid driver.  Don't even think about it.  If you want to do
comparisons with a central-limit scheme, use a grid driver for both.

A sample .ini file:

"""
[Classifier]
use_central_limit2: True
max_discriminators: 50
zscore_ratio_cutoff: 1.9

[TestDriver]
spam_cutoff: 0.50
nbuckets: 4
"""

Note that, for now, every message gets one of just 4 distinct scores when a
central-limit scheme is in use:

0.00  -- certain it's ham
0.49  -- guesses ham but is unsure
0.51  -- guesses spam but is unsure
1.00  -- certain it's spam

That's the reason for setting nbuckets to 4:  more than that won't do you a
lick of good, as there are only 4 possible scores.  spam_cutoff must also be
exactly 0.50, and for the same reason; the "best cutoff" histogram analysis
is still displayed, but is meaningless.

Nothing is known about how max_discrimators affects this.  Play!

Nothing is known about how use_central_limit (as opposed to
use_central_limit2) works with this.  Play!

When one of the central-limit schemes is in use, the list of (word, prob)
clues returned by spamprob() now has two made-up entries at the start, in
this order:

('*zham*', zham), ('*zspam*', zspam)

These are the ham and spam zscores.  So, for example, a listing of a false
positive now begins like so:

Data/Ham/Set2/143733.txt
prob = 0.51
prob('*zham*') = -65.9011
prob('*zspam*') = -53.3419
prob('header:Errors-To:1') = 0.0266272
prob('subject:: ') = 0.0266272
prob('python') = 0.0412844
...

Here's something remarkable.  I just tried this, with the .ini file given
above, like so:

    timtest.py -n5 --s=10 --h=10 -s123

In other words, this does 5**2-5 = 20 runs, training the classifier each
time on *just* 10 random ham and 10 random spam, and then predicting against
10 disjoint random ham and 10 disjoint random spam.

Here's the bottom line from this run (the "all runs" histograms at the end):

-> <stat> Ham scores for all runs: 200 items; mean 5.42; sdev 15.42
-> <stat> min 0; median 0; max 51
* = 3 items
  0.0 178 ************************************************************
 25.0  19 *******
 50.0   3 *
 75.0   0

-> <stat> Spam scores for all runs: 200 items; mean 93.13; sdev 17.03
-> <stat> min 49; median 100; max 100
* = 3 items
  0.0   0
 25.0   1 *
 50.0  27 *********
 75.0 172 **********************************************************

The 0.00 score ends up in the  0.0 bucket.
The 0.49 score ends up in the 25.0 bucket.
The 0.51 score ends up in the 50.0 bucket.
The 1.00 score ends up in the 75.0 bucket.

Even with such little data, this was never wrong when it was certain.  For
ham, it was wrong 3 of the 19+3=22 times it was unsure.
For spam, it was wrong 1 of the 27+1=28 times it was unsure.

What surprised me most there is-- given how little training was done --just
how often it *was* "certain".

This continues to suggest that these schemes have enormous potential, but we
still don't know how to exploit it (although with my pragmatic hat on, I'd
say we're already doing a not-too-shabby job of exploiting it <wink>).