[Spambayes] understanding high false negative rate
Tim Peters
tim.one@comcast.net
Fri, 06 Sep 2002 17:05:22 -0400
[Jeremy Hylton]
> I've tried to do some testing with some personal collections of ham
> and spam. I'm seeing very high false negative rates. 20-30% is
> typical.
That's very high indeed.
> The false positive rate is 0-3%. (Finally! I had to scrub
> a bunch of previously unnoticed spam from my inbox.) Both collections
> have about 1100 messages.
Does this mean you trained on about 1100 of each?
> I'd like to figure out why my false negative rate is so high, but I'm
> not sure what details I should look at to diagnose. I'm assuming that
> mboxtest.py is basically correct, but it could have bugs.
>
> One possibility is that my ham test set isn't nearly so useful as the
> python-list, since it isn't focused on a single topic.
Heh -- when's the last time you read c.l.py <wink>? "Python" is a very
strong ham indicator, and that certainly helps. "wrote:" is an even
stronger ham indicator there, and that helps even more.
> I've got some python email, personal correspondence, questions about my
> Shakespeare web site, and a few email newsletters I get on a regular
basis.
> I've got receipts from various online order sites, mail from the company
> that manages my student loans, etc. Maybe the great variety in my
> non-spam email makes it harder to find good discriminators for spam?
Can't guess. You're in a good position to start adding more headers into
the analysis, though. For example, an easy start would be to uncomment the
header-counting lines in tokenize() (look for "Anthony"). Likely the most
valuable thing it's missing then is some special parsing and tagging of
Received headers.
> Here's a sample spam distribution from a test run:
>
> Spam distribution for this pair:
> * = 3 items
> 0.00 73 *************************
> 2.50 0
> 5.00 2 *
> 7.50 0
> 10.00 0
> 12.50 1 *
> 15.00 0
> 17.50 1 *
> 20.00 1 *
> 22.50 0
> 25.00 2 *
> 27.50 0
> 30.00 0
> 32.50 0
> 35.00 0
> 37.50 0
> 40.00 0
> 42.50 0
> 45.00 0
> 47.50 0
> 50.00 0
> 52.50 0
> 55.00 0
> 57.50 1 *
> 60.00 0
> 62.50 1 *
> 65.00 0
> 67.50 0
> 70.00 1 *
> 72.50 0
> 75.00 0
> 77.50 0
> 80.00 2 *
> 82.50 2 *
> 85.00 2 *
> 87.50 0
> 90.00 4 **
> 92.50 1 *
> 95.00 5 **
> 97.50 127 *******************************************
So the bulk of your f-n woes come from spam scoring near 0.0. Good to know.
> And here's a sample false negative. (I'll quote the report so it
> stands out.) One thing I don't understand is how the spam probability
> for the message is so low, when there are several high indicators and
> few low indicators.
You're hallucinating. Let's look:
> > Low prob spam! 1.64654685184e-11
> > /home/jeremy/Mail/spam:242 subject: your web site has been mapped
> > prob('millions') = 0.99
> > prob('skip:= 40') = 0.99
> > prob('"remove"') = 0.99
> > prob('from:email addr:mail') = 0.99
> > prob('email addr:alum') = 0.01
> > prob('status') = 0.01
> > prob('connected') = 0.01
> > prob('returning') = 0.01
Those 8 cancel out completely. They're the strongest indicators it found in
both directions, and it's exactly as if they didn't exist. I'll sort the
rest from low to high:
> > prob('notices') = 0.01
> > prob('email addr:mit') = 0.01
> > prob("i'd") = 0.0470418
> > prob('survey') = 0.0850202
> > prob('wide') = 0.0911528
> > prob('added') = 0.133131
> > prob('mark') = 0.136416
> > prob('survey.') = 0.14931
> > prob('current') = 0.152639
> > prob('its') = 0.155044
> > prob('officer') = 0.208406
> > prob('charges') = 0.208406
> > prob('from:email addr:com>') = 0.224056
> > prob('every') = 0.789741
> > prob('http1:asp') = 0.88055
> > prob('free') = 0.818103
So you're got 13 indicators below 0.5, versus 3 above 0.5: it's
overwhelmingly in favor of ham.
> >
> > From VM Mon Jul 24 10:05:39 2000
> > Return-Path: <undeliverables@mail.internetseer.com>
> > Message-ID: <0112a1010021870MARS1@mars1.internetseer.com>
> > Status: RO
> > From: "InternetSeer.com" <services@mail.internetseer.com>
> > To: jeremy@alum.mit.edu
> > Subject: Your web site has been mapped
> > Date: 23 Jul 2000 22:10:11 -0400
> >
> > Freewire has added your web site to its map of the World Wide
> Web. Freewire will continue to monitor millions of links and web
> sites every day during its ongoing web survey.
> >
> > If it is important for you to know that your site is connected
> to the web at all times, Freewire has arranged with
> InternetSeer.com to notify you when your site does not respond.
> This means that, AT NO CHARGE; InternetSeer.com will monitor your
> Web site every hour and send notification to you by email
> whenever your site is not connected to the Web. There are NO
> current or future charges associated with this service.
> >
> > To begin your FREE monitoring NOW, activate your account at:
> > http://www.internetseer.com/signup.asp?email=jeremy@alum.mit.edu
> >
> > Mark McLellan
> > Chief Technology Officer
> > Freewire.com
> >
> > Is your web site status important to you? I'd love your
> comments. If you prefer not to receive any future notices that
> result from our ongoing survey please let me know by returning
> this email with the word "remove" in the subject line.
> >
> > =============================================
> > ##Remove: jeremy@alum.mit.edu##
Yuck: it got two 0.01's from embedding your email address at the bottom
here.