[Spambayes] To label or not to label, a practical question

Fri Jul 8 05:53:13 CEST 2005

> -----Original Message-----
> From: spambayes-bounces at python.org 
> [mailto:spambayes-bounces at python.org] On Behalf Of Michael D. Adams
> Sent: Thursday, July 07, 2005 10:26 PM
> To: spambayes at python.org
> Subject: [Spambayes] To label or not to label, a practical question
> 
> My ISP provides a spam filtering service (server side) that 
> labels the things that they think are spam by putting an 
> extra string in the subject like (e.g. "--Spam--" at the 
> front).  Their filters don't catch everything so I want to 
> also use SpamBayes to eliminate the spam that my ISP doesn't 
> label.  My question is whether or not I should train 
> SpamBayes with the spams that get labeled by my ISP.  I could 
> easily see SpamBayes picking up on the "--Spam--" string in 
> the subject line and filtering just based on that.  

Tony (who is much more knowledgeable than I on this product)
has already answered so merely consider the following:

If you rigorously train false positives (from your ISP) then
these will show that SOME Ham does have this tag and thus it
will NOT be sure Spam sign.

If the ISP is "always right" then it will be (relatively)
reliable spam sign and that is probably what you want.

Just keep training on all mistakes -- that is probably the
single most important trick to using Bayesian spam classifiers.

You must NOT get lazy and just delete or ignore mistakes.

> On the 
> other hand maybe that would introduce some selection bias or 
> a bad spam vs ham ratio for training (e.g. maybe I'll get 50 
> ham, 40 spam caught by my ISP, and 10 spam not caught by my 
> ISP (I don't know what the ratio is yet, I only just started 
> using my ISP's filter)).
> 
> Does anyone have any advice on whether these might interfere 
> or how to avoid that interference?  Should I even be using my 
> ISP's filter along with SpamBayes or just SpamBayes by itself?

My bet is they will not.

My SpamAssassin ****SPAM***** gets by SpamBayes WHEN it is 
obviously not spam (I only let through the mistakes made
by SpamAssassin so most of those tagged which reach Outlook
are NOT spam) and it still grabs it if it is Spam (most of
the time.)

One thing about Spam filters in my experience, as often as
they make mistakes, they catch things correctly that *I*,
a human, would actually misclassify on first naive glance.

(E.g., a mailing from a list that is NOT spam, but where
someone has injected spam into the list -- a message from
a technical "spammer" that I actually wish to see.)