[Spambayes] Messages not moving / Sneaky HTML spam

Gray Norton gray at stanfordalumni.org
Mon Oct 27 19:22:03 EST 2003


> -----Original Message-----
> From: Ryan Malayter [mailto:rmalayter at bai.org]
> Sent: Monday, October 27, 2003 3:38 PM
> To: Gray Norton; spambayes at python.org
> Subject: RE: [Spambayes] Messages not moving / Sneaky HTML spam

Hi, Ryan -

Thanks for the quick response...

> This is a toughie. I've seen a couple of spams that have portions of
> popular AP news articles in white-on-white text.

<snip>

> If you're training base includes a lot of "news" ham, it follows that
> spams wich contain an news article in white-on-white might contain a
few
> of your "innocent" words and sneak past SpamBayes.

Interestingly enough, the white-on-white text in the spam I'm getting
doesn't seem to be from news sources (and my training set includes no
news content). These spammers seem to be taking their text from more
obscure sources -- essays, academic papers, fiction, and the like -- and
a good percentage of the words are not in my db at all.

I'm no expert in Bayesian analysis, or in the specific techniques
employed by Spambayes, but my assumption was that these "new"
(neither-ham-nor-spam) tokens were enough to lower the spam score and
let the message through. Is this plausible, or are entirely new tokens
not considered in calculating the score?

It seems to me that some sort of pre-tokenization filter might be
effective in catching these messages, since there may be telltale signs
of the technique in the markup that are discarded once the message has
been reduced to tokens. Would the Spambayes team object in principle to
such a step?

Thanks again,

Gray





More information about the Spambayes mailing list