[Spambayes] Spamvolution
Tim Peters
tim.one@comcast.net
Fri, 20 Sep 2002 12:30:16 -0400
[Charles Cazabon]
> Sorry, Tim, but that's not correct. Bruce was my cubicle-mate
> while he was collecting most of that spam, and "bfsmedia" is the company
> that owns the machines that mail was collected on (they're about ten
> meters from where I'm currently sitting). The source of this message
> was actually pool-209-128-140-231.gent.ipa.net ([209.128.140.231].
Oh, fudge. I was afraid of that: despite that I'm ignoring almost all the
header lines (and so running a crippled version of the algorithm), this
bogus clue about my mixed-source corpora is still sneaking in.
Oops! It's actually not! I'm actually ignoring the header lines in which
bfsmedia appears -- that's a correlation I had noticed by *eyeball*. The
high-prob spam words in that were really
prob('competition') = 0.858033
prob('grand') = 0.863717
prob('qualified') = 0.869686
prob('air') = 0.869976
prob('outstanding') = 0.887064
prob('compete') = 0.928292
prob('country') = 0.941092
prob('subject:Great') = 0.945813
prob('outdoor') = 0.99
prob('finalists') = 0.99
prob('subject:Outdoor') = 0.99
prob('eagle') = 0.99
prob('retriever') = 0.99
prob('allison') = 0.99
> I'm not so sure this false negative isn't actually a true negative, or at
> worst a misaddressed piece of mail. It doesn't appear to have
> been trying to sell anything.
It sure didn't. The collection of To: addresses was spammish in its sheer
bulk, but very hammish in the wide variety of target domains, and in having
a "real name" to go along with each email address. For the life of me I
can't find one of Bruce's bait addresses in it, either.
Tester's dilemma: do I take this out of the spam set or not? I removed two
others in the past, like output from a cron job Bruce apparently arranged to
mail to himself from one his spam collection machines. This one is much
muddier than those, though. Screw it: common sense rules. It's ham.