[Spambayes] Some more experiences with the Outlook plugin

Tue Nov 12 00:21:20 2002

[Moore, Paul]

You might want to play along with "the other" training strategy we're
trying:  last week I wiped my database and started over from scratch,
training it *only* on mistakes and unsures.  It's been thru a few thousand
msgs since then, but so far I've trained it on only 51 ham and 55 spam.  The
Unsures are weird, but the Unsure rate is falling, and it makes very few
outright mistakes now (BTW, I have ham_cutoff at 20 and spam_cutoff at 80 in
the Outlook client).

> ...
> 1. To start with, configure the plugin to define one "Spam" folder and
>    one "Unsure" folder, and define all other folders as "Ham". [1]

> [1] I got this wrong at the start - the key point to stress here is
>     that *everything* that isn't spam is ham - by definition. Trying
>     to "help" the classifier by telling it to ignore messages which
>     you "know" are ham is actually detrimental - if you know, let the
>     classifier find out!

We don't have a way to train on a random sample now, and that's going to be
a killer for some people (e.g., Sean True has 2 gigabytes of ham).

> 2. Train the classifier on whatever you have available. This will
>    usually be massively overbalanced in favour of ham (few people
>    collect their spam) but it *will* make a start. [2]

> [2] I'm getting pretty good results now (but see below), with 5661
>     ham and 303 spam, but even with under 100 spam (admittedly with
>     less ham, as I made the "exclude some ham" mistake) I was getting
>     visible benefits.

My guess is that you'd do better by striving for no more than a 3:1
imbalance in either direction.  There are reasons to despise the "purely
mistake-based training" described at the top, but it seems very naturally to
keep the training sets in rough balance.

> 3. Run with this for a while, incrementally training on mistakes and
>    unsures.

Training on those is vital no matter what else you do.

> Keep all of the spam!

I'm afraid that one won't fly over time, except for researchers.  And people
boldly using unstable pre-alpha code <wink>.

> 4. Periodically, retrain the full database on all the collected ham
>    and spam.

That shouldn't be necessary when the code is complete and stable.

> Notes:
>
>
>
> Other points:
>
> * The collection I end up with is still biased - there are a lot of
>   ham messages which I just read and delete, and they are probably
>   somehow "similar". While I could retain these, this would require
>   a much more significant change to my way of working.

Keep working the way you like!  The client should eventually be able to
deduce what's ham by watching you throw away things without first calling
them spam.

> * Results still seem to be pretty much hapax based (if I understand the
>   term and its usage). Looking at the clues for a message often shows
>   some pretty bizarre tokens showing up as *either* sort of clue. (One
>   message showed 'yet' as a ham clue with a probability of 0.000877364!)

hapax means a word that appeared only once in your entire training corpus.
In the list you gave below, there are very few hapaxes (I recognize them
from the probabilities; I should probably add code to the client to display
the raw counts too):

> 'sweet'                        0.155172  these 4 appeared in one ham
> 'ads,'                         0.155172
> 'insult'                       0.155172
> 'subject:COMPUTER'             0.155172

> 'membership.'                  0.844828  these 3 appeared in one spam,
> 'home-based'                   0.844828  presumably itself since you
> 'cash.'                        0.844828  said you trained on it

> * Following on from this, I also see Tim's behaviour of surprising
>   unsure cases (or worse, false negatives!).

I expect for a very different reason, though:  your 18:1 ham:spam imbalance.
This implies words can get spamprobs much closer to 0 than they can get to
1.  There's just not enough spam to *justify* spamprobs closer to 1 than
there is enough ham to justify spamprobs closer to 0.  Let's look at the 3
most extreme words on both ends of your listing:

> '(and'                         0.00044603
> 'looking'                      0.000489716
> 'added'                        0.000613999

> 'subject:your'                 0.973253
> 'click'                        0.974006
> '"remove"'                     0.985437

'(and' is nearly "33 times closer" to 0 than '"remove"' is to 1, and that
makes the accidental appearance of a ham word in spam much more powerful
than the systematic appearance of a spam word in spam.  If you only had 300
ham in your training set, it would be much harder for a word to get a very
low spamprob; contrarily, if you had 5500 spam in your training, it would be
much easier for a word to get a very high spamprob.  As is, your strong ham
words are much more powerful than your strong spam words, and almost *must*
be.

Anthony Baxter here routinely runs with a ridiculous <wink> ham:spam ratio
too, but you're even way beyond him (his is about 6:1).  This brings out
effects I've never seen before.

>   Worst case recently was a message which scored as solid ham.  I
>   trained on it as "Spam", and rescored it. It still scored 5 - solid
>   ham.

That's because you're *not* hapax-driven.  If you were, the score would have
shot up to 100 (maybe 99).  All ham contains spam words, and my guess is
you've got so much more ham than spam that it's drowning out the spam.
That's but picturesque but inaccurate <wink>.  A more accurate speculation
was given above.

>   My immediate reaction was "But I just *told* you it's spam!". I know
>   that isn't how the classifier works, but even so it was unsettling.
>   FWIW, I attach the spam clues for this one (I don't know if they make
>   any sense in isolation, but it can't hurt...)

No more than what I copied above.  If you like, send me the original (as an
attachment), and I'll score it under my well-trained classifier (the one I
parked last week when starting the mistake-only training experiment).  That
one was trained on about 2 thousand recent spam.

If that works better for me than for you, then I'd like tp try another
experiment, shipping you just the stronger-than-hapax spam words from that
classifier, along with a bit of code you can run to *merge* that into your
own classifier.  That would be an experiment in "seeding" a classifier,
something we haven't gotten a good start on here yet.

> * I don't know how long it will be before I start grudging the use of
>   disk space to store spam. At that point, the nasty question of
>   whether I keep it, or risk being unable to recreate my database,
>   becomes important.

At 300 measly spam saved, I should remind you that a gigabyte of disk space
costs less than the value of your time worrying about it <wink>.

> I need to look at how to get some more information out of the
> classifier, to try to understand how much of the good results I see
> are down to luck (hapaxes, I guess - which makes me think of "happy
> accidents" rather than its real meaning...)

Cool!  When hapaxes work, they *are* happy accidents!  I like it.

> and hence is fragile, and how much is actually solid.  Can anyone point
> me at the right part of the code to read to find this?

classifier.py contains all the code for probability estimation and scoring.