[Spambayes] RE: Central Limit Theorem??!! :)

Sat, 28 Sep 2002 16:01:23 -0400

This is a log-central-limit experiment with a 10x10 grid.  500 ham and 500
spam selected at random from each of my sets, lumped into 10 pairs.  Then

train on pair 1, predict on pairs 2 thru 10
train on pair 2, predict on pairs 1, and 3 thru 10
...
train on pair 10, predict on pairs 1 thru 9

In all, that's 90 prediction runs on 1000 msgs per run, for 90,000 total
predictions.  Each of the 10*1000 = 10,000 unique msgs is predicted 9 times.

This is a hard test, as 500ham+500spam isn't a lot of training data, and
each of the 10 classifiers has to predict against 9x more msgs than it's
been taught about.

There are only 4 distinct scores in this experiment:

0.00    I'm sure it's ham.
0.49    I guess it's ham, but I may be baffled.
0.51    I guess it's spam, but I may be baffled.
1.00    I'm sure it's spam.

This code replaced central_limit_spamprob2()'s score computation (the first
3 lines are already there; I include them for context):

        zham = (hmean - self.hammean) / sqrt(self.hamvar / n)
        zspam = (smean - self.spammean) / sqrt(self.spamvar / n)
        stat = abs(zham) - abs(zspam)  # > 0 for spam, < 0 for ham

        if min(abs(zham), abs(zspam)) < 10.0 and abs(stat) > 5.0:
            stat = stat > 0.0 and 1.0 or 0.0
        else:
            print '*', n, zham, zspam, stat,
            if stat > 0.0:
                stat = 0.51
            else:
                stat = 0.49
            print stat

IOW, it's certain iff at least one z-score is "not insanely large", and the
other z-score is "substantially larger".  This is purely a hack just to see
what would happen.  No direct accounting is made of the number of words in a
message, although because n is in the denominators of the sample mean
variances, a small n leads to larger variances than a large n, and that
makes it much harder for the "and abs(stat) > 5.0" to succeed.

Relevant non-default options:

"""
[Tokenizer]
mine_message_ids: True

[Classifier]
count_duplicates_only_once_in_training: True
max_discriminators: 50
use_central_limit2: True

[TestDriver]
spam_cutoff: 0.50
nbuckets: 4
"""

Note that I boosted max_discriminators from 30.  I'm not sure why I did.
Why not <wink>?

The bottom lines are in the tiny histograms:

-> <stat> Ham scores for all runs: 45000 items; mean 1.18; sdev 7.51
* = 720 items
  0.0 43917 *************************************************************
 25.0  1069 **    unsure and guessed right
 50.0    14 *     unsure and guessed wrong
 75.0     0       sure and guessed wrong

-> <stat> Spam scores for all runs: 45000 items; mean 99.04; sdev 6.91
* = 724 items
  0.0    12 *     sure and guessed wrong
 25.0    56 *     unsure and guessed wrong
 50.0   802 **    unsure and guessed right
 75.0 44130 *************************************************************

So,

When predicting ham:
    97.59% of the time it was sure of itself.
        In those cases, it was never wrong.
     2.41% of the time it had strong reason to doubt itself.
        In those cases, it guessed wrong 1.31% of the time.

When predicting spam:
    98.09% of the time it was sure of itself.
        In those cases, it was wrong 0.027% of the time.
     1.91% of the time it had strong reason to doubt itself.
        In those cases, it guessed wrong 6.53% of the time.

Recall that each message is predicted against by 9 different classifiers
here.  The 12 cases in which spam was misidentified are due to fewer than 12
distinct messages.  The first has few words (so the central-limit gimmick
has an excuse <wink>):

"""
Data/Spam/Set6/11926.txt
prob = 0.0
prob('steve') = 0.0847751
prob('mike,') = 0.0918367
prob('subject:Thanks') = 0.155172
prob('regards,') = 0.171593
prob('thanks') = 0.285091
prob('content-type:text/plain') = 0.308556
prob('again') = 0.640598
prob('header:Reply-To:1') = 0.701388
prob('best') = 0.736387
prob('header:Message-Id:1') = 0.823243
prob('header:Return-Path:1') = 0.906194
prob('subject:For') = 0.908163
prob('great!') = 0.908163

Return-Path: <joelackey@ns1.ehost2102.com>
Delivered-To: em-ca-bait@em.ca
Received: (qmail 15585 invoked from network); 28 Feb 2002 18:30:34 -0000
Received: from unknown (HELO smtp4.westnet24.com) (64.215.52.101)
  by agamemnon.bfsmedia.com with SMTP; 28 Feb 2002 18:30:34 -0000
Date: Thu, 28 Feb 2002 13:37:57 -0500
Message-Id: <200202281837.g1SIbvi09018@smtp4.westnet24.com>
X-Mailer: Mozilla 4.6 [en] (WinNT; U)
Reply-To: <joelackey@ns1.ehost2102.com>
From: <joelackey@ns1.ehost2102.com>
To: <bait_7@earthlink.net>
Subject: Thanks For Lunch Mike                    beksng
Content-Length: 71
Lines: 10

Mike,

  Thanks again for lunch it was great!

Best regards,

Steve
"""

The second also had few words, and was a mass of Javascript without any
whitespace except for arbitrary line breaks (each line is about the same
length).  The "words" show that the tokenizer really had no idea what to do
with this:

prob('skip:" 30') = 0.0412844
prob('function') = 0.072284
prob('skip:( 50') = 0.0918367
prob('skip:" 70') = 0.0918367
prob('skip:" 60') = 0.155172
prob('skip:f 20') = 0.292388
prob('subject:new') = 0.298658
prob('skip:" 10') = 0.341427
prob('header:Message-ID:1') = 0.368361
prob('else') = 0.373061
prob('header:MIME-Version:1') = 0.634608
prob('charset:iso-8859-1') = 0.749673
prob('subject:information') = 0.844828
prob('header:Return-Path:1') = 0.906194
prob('skip:f 50') = 0.908163
prob('skip:= 10') = 0.958716
prob('content-type:text/html') = 0.999284

I got too bored to continue then.  Unfortunately, I didn't have enough
output to determine what the z-scores for these guys were.

I'm not a statistician, and I'm sure there are better ways to extract "a
decision" and "a confidence" out of the z-scores; this was just a quick hack
to see whether it had promise.

One thing I noted here is that, with this little training data, the
population ham variances across classifiers didn't hold steady (these are
the mean of the ln(1-spamprob) "extreme word" ham population statistics):

hammean -0.275301462139 hamvar 0.241400997965
hammean -0.270075033304 hamvar 0.237971084164
hammean -0.27729534552  hamvar 0.242090771903
hammean -0.28018840187  hamvar 0.242071754063
hammean -0.262743024167 hamvar 0.226818888396
hammean -0.293745611772 hamvar 0.272036821231
hammean -0.271746041979 hamvar 0.232452901988
hammean -0.24251831616  hamvar 0.186177919836
hammean -0.256875930876 hamvar 0.208263894923
hammean -0.264274555704 hamvar 0.221385382894

Ditto for population spam variance:

spammean -0.147931424503 spamvar 0.0733613298447
spammean -0.157632947165 spamvar 0.0966048083774
spammean -0.153901431979 spamvar 0.0750210065953
spammean -0.143638282908 spamvar 0.0653981265422
spammean -0.152840521441 spamvar 0.0825987634543
spammean -0.146861931682 spamvar 0.0760974846927
spammean -0.150647341728 spamvar 0.0733247594002
spammean -0.167612616966 spamvar 0.10231334993
spammean -0.163965666611 spamvar 0.0985899485577
spammean -0.152115124735 spamvar 0.0776889891763

Finally, exactly the same test but with the current default scheme, fiddled
via

"""
[Tokenizer]
mine_message_ids: True

[Classifier]
count_duplicates_only_once_in_training: True

[TestDriver]
spam_cutoff: 0.56
"""

(spam_cutoff is reduced from the default because
count_duplicates_only_once_in_training reduces both means)

total unique false pos 7
total unique false neg 27
average fp % 0.0555555555556
average fn % 0.142222222222

That's not too shabby itself, although note that "total unique" only counts
distinct msgs misclassified, and, e.g., if 9 classifiers all call a
particular ham a spam, it only counts once against the "total unique false
pos" count.  It's hard to compare these runs directly.

I noticed that the two specific squashed-Javascript and "Thanks again for
lunch" spams mentioned above were popular false negatives under this run
too.