[Spambayes] Spamvolution

Tim Peters tim.one@comcast.net
Fri, 20 Sep 2002 12:30:16 -0400


[Charles Cazabon]
> Sorry, Tim, but that's not correct.  Bruce was my cubicle-mate
> while he was collecting most of that spam, and "bfsmedia" is the company
> that owns the machines that mail was collected on (they're about ten
> meters  from where I'm currently sitting).  The source of this message
> was actually pool-209-128-140-231.gent.ipa.net ([209.128.140.231].

Oh, fudge.  I was afraid of that:  despite that I'm ignoring almost all the
header lines (and so running a crippled version of the algorithm), this
bogus clue about my mixed-source corpora is still sneaking in.

Oops!  It's actually not!  I'm actually ignoring the header lines in which
bfsmedia appears -- that's a correlation I had noticed by *eyeball*.  The
high-prob spam words in that were really

prob('competition') = 0.858033
prob('grand') = 0.863717
prob('qualified') = 0.869686
prob('air') = 0.869976
prob('outstanding') = 0.887064
prob('compete') = 0.928292
prob('country') = 0.941092
prob('subject:Great') = 0.945813
prob('outdoor') = 0.99
prob('finalists') = 0.99
prob('subject:Outdoor') = 0.99
prob('eagle') = 0.99
prob('retriever') = 0.99
prob('allison') = 0.99

> I'm not so sure this false negative isn't actually a true negative, or at
> worst a misaddressed piece of mail.  It doesn't appear to have
> been trying to sell anything.

It sure didn't.  The collection of To: addresses was spammish in its sheer
bulk, but very hammish in the wide variety of target domains, and in having
a "real name" to go along with each email address.  For the life of me I
can't find one of Bruce's bait addresses in it, either.

Tester's dilemma:  do I take this out of the spam set or not?  I removed two
others in the past, like output from a cron job Bruce apparently arranged to
mail to himself from one his spam collection machines.  This one is much
muddier than those, though.  Screw it:  common sense rules.  It's ham.