[Spambayes] understanding high false negative rate

Tim Peters tim.one@comcast.net
Fri, 06 Sep 2002 17:05:22 -0400


[Jeremy Hylton]
> I've tried to do some testing with some personal collections of ham
> and spam.  I'm seeing very high false negative rates. 20-30% is
> typical.

That's very high indeed.

> The false positive rate is 0-3%.  (Finally!  I had to scrub
> a bunch of previously unnoticed spam from my inbox.)  Both collections
> have about 1100 messages.

Does this mean you trained on about 1100 of each?

> I'd like to figure out why my false negative rate is so high, but I'm
> not sure what details I should look at to diagnose.  I'm assuming that
> mboxtest.py is basically correct, but it could have bugs.
>
> One possibility is that my ham test set isn't nearly so useful as the
> python-list, since it isn't focused on a single topic.

Heh -- when's the last time you read c.l.py <wink>?  "Python" is a very
strong ham indicator, and that certainly helps.  "wrote:" is an even
stronger ham indicator there, and that helps even more.

> I've got some python email, personal correspondence, questions about my
> Shakespeare web site, and a few email newsletters I get on a regular
basis.
> I've got receipts from various online order sites, mail from the company
> that manages my student loans, etc.  Maybe the great variety in my
> non-spam email makes it harder to find good discriminators for spam?

Can't guess.  You're in a good position to start adding more headers into
the analysis, though.  For example, an easy start would be to uncomment the
header-counting lines in tokenize() (look for "Anthony").  Likely the most
valuable thing it's missing then is some special parsing and tagging of
Received headers.

> Here's a sample spam distribution from a test run:
>
> Spam distribution for this pair:
> * = 3 items
>   0.00  73 *************************
>   2.50   0
>   5.00   2 *
>   7.50   0
>  10.00   0
>  12.50   1 *
>  15.00   0
>  17.50   1 *
>  20.00   1 *
>  22.50   0
>  25.00   2 *
>  27.50   0
>  30.00   0
>  32.50   0
>  35.00   0
>  37.50   0
>  40.00   0
>  42.50   0
>  45.00   0
>  47.50   0
>  50.00   0
>  52.50   0
>  55.00   0
>  57.50   1 *
>  60.00   0
>  62.50   1 *
>  65.00   0
>  67.50   0
>  70.00   1 *
>  72.50   0
>  75.00   0
>  77.50   0
>  80.00   2 *
>  82.50   2 *
>  85.00   2 *
>  87.50   0
>  90.00   4 **
>  92.50   1 *
>  95.00   5 **
>  97.50 127 *******************************************

So the bulk of your f-n woes come from spam scoring near 0.0.  Good to know.

> And here's a sample false negative.  (I'll quote the report so it
> stands out.)  One thing I don't understand is how the spam probability
> for the message is so low, when there are several high indicators and
> few low indicators.

You're hallucinating.  Let's look:

> > Low prob spam! 1.64654685184e-11
> > /home/jeremy/Mail/spam:242 subject: your web site has been mapped
> > prob('millions') = 0.99
> > prob('skip:= 40') = 0.99
> > prob('"remove"') = 0.99
> > prob('from:email addr:mail') = 0.99
> > prob('email addr:alum') = 0.01
> > prob('status') = 0.01
> > prob('connected') = 0.01
> > prob('returning') = 0.01

Those 8 cancel out completely.  They're the strongest indicators it found in
both directions, and it's exactly as if they didn't exist.  I'll sort the
rest from low to high:

> > prob('notices') = 0.01
> > prob('email addr:mit') = 0.01
> > prob("i'd") = 0.0470418
> > prob('survey') = 0.0850202
> > prob('wide') = 0.0911528
> > prob('added') = 0.133131
> > prob('mark') = 0.136416
> > prob('survey.') = 0.14931
> > prob('current') = 0.152639
> > prob('its') = 0.155044
> > prob('officer') = 0.208406
> > prob('charges') = 0.208406
> > prob('from:email addr:com>') = 0.224056

> > prob('every') = 0.789741
> > prob('http1:asp') = 0.88055
> > prob('free') = 0.818103

So you're got 13 indicators below 0.5, versus 3 above 0.5:  it's
overwhelmingly in favor of ham.


> >
> > From VM Mon Jul 24 10:05:39 2000
> > Return-Path: <undeliverables@mail.internetseer.com>
> > Message-ID: <0112a1010021870MARS1@mars1.internetseer.com>
> > Status: RO
> > From: "InternetSeer.com" <services@mail.internetseer.com>
> > To: jeremy@alum.mit.edu
> > Subject: Your web site has been mapped
> > Date: 23 Jul 2000 22:10:11 -0400
> >
> > Freewire has added your web site to its map of the World Wide
> Web.  Freewire will continue to monitor millions of links and web
> sites every day during its ongoing web survey.
> >
> > If it is important for you to know that your site is connected
> to the web at all times, Freewire has arranged with
> InternetSeer.com to notify you when your site does not respond.
> This means that, AT NO CHARGE; InternetSeer.com will monitor your
> Web site every hour and send notification to you by email
> whenever your site is not connected to the Web. There are NO
> current or future charges associated with this service.
> >
> > To begin your FREE monitoring NOW, activate your account at:
> > http://www.internetseer.com/signup.asp?email=jeremy@alum.mit.edu
> >
> > Mark McLellan
> > Chief Technology Officer
> > Freewire.com
> >
> > Is your web site status important to you? I'd love your
> comments. If you prefer not to receive any future notices that
> result from our ongoing survey please let me know by returning
> this email with the word "remove" in the subject line.
> >
> > =============================================
> > ##Remove: jeremy@alum.mit.edu##

Yuck:  it got two 0.01's from embedding your email address at the bottom
here.