Tim Peters tim.one at comcast.net
Sun Nov 16 00:39:52 EST 2003

[Seth Goodman]

> Well, what bothers me, so far, is that despite training on 620 ham
> and 1403 spam, SpamBayes still manages to miss (score as ham) 5-10
> messages per day out of around 150 scored messages.

I get about 700 emails per day, about 200 of them spam, and see one or two
spam left in my inbox per week.  My training data is a little better
balanced than yours, and about the same total number of messages, and I'm
certain I haven't trained any messages into the wrong category.  My cutoffs
are at 20 and 80.  The vast bulk of my ham comes from technical mailing
lists, which appears exceptionally easy to identify as ham.

> Most of these missed spams have an initial score very close to zero, so
> simply lowering the ham threshold would not fix it.

There's no way to diagnose this without staring at the evidence the
classifier used to reach its decision.  It's not magic, it's just throwing a
bunch of numbers at each other <wink>.  I don't recall which SpamBayes
application you're using.  If it's the Outlook addin, just do

    Spambayes ->
        Show spam clues for current message

> After training as spam, their spam score often increases respectably,
> but sometimes, the score stays below 5%.  This indicates that the same
> message would be missed next time, as well.


> I don't know if I just need to get a bigger or more balanced training
> set, if there are some types of tokens (such as embedded URL's in HTML
> spam) that are not currently parsed or if this is just as good as it
> gets.

It sounds unreasonably bad to me, but we're not going to *guess* the cause.
If you generate a report containing the evidence the classifier used, that
will tell us exactly why the message got its score.  Best guess is that
something is screwed up in your training data.  This *sounds* like symptoms
some others have had before they discovered that they trained some messages
into the wrong category.

