spam classification breaker

Tim Peters tim.one at comcast.net
Sun Feb 8 01:37:41 EST 2004


[Robin Becker]
> ....OK I guess I'm trying to get at the following hand waving argument.
> Since most people agree about what is ham or spam there must be a
> general recognizer for each.

I believe the antecedent is false, so the conclusion doesn't follow.  Most
reports I've seen said people disagree about the ham-vs-spam distinction on
3-5% of messages.  Since the error rates on a personal classifier like
SpamBayes are typically much less than that, that's highly significant
disagreement.

> My question is then, is  whether it's
> possible to define a camouflage mechanism that turns ham into spam or
> vice versa. Most people reading a newspaper article would classify it as
> spam.

Eh?

>       If I insert a short  ad v ert into the middle the quick
> scan process is gone, but     I might be able if everything is
> aet up correctly     to get   a forbidden word
> set into the text in plain si g ht even
> though it's specifically   fo r bidden by your
> all singing and dancing     B a yesian analyser. It is well known
> that word/space runs are very distracting which is why printers
> have long tried to eliminate them.

Most spam obfuscation tricks have to do with HTML tricks, hiding the words
in the source encoding, but arranging to get them *rendered* legibly despite
that.  Filters can deobfuscate, though.  Spam that *renders* in obfuscated
ways limits its effectiveness as a sales tool because of that.  If printers
find that people don't want to read text with vertical rivers of whitespace,
are people more likely to read that kind of stuff if it's trying to sell
them something they don't even want <wink>?  The point of spam isn't to be
seen, it's to sell product.

BTW, rule-based filters may go ga-ga over a single "forbidden" word, but
Bayesian filters are a preponderance-of-evidence approach -- no single
feature is strong enough on its own to drive the decision.  If they want to
sell you something, they have to give you an idea of what it is, praise its
virtues, and provide a way for you to get back to them with your money.  All
of those selling necessities generate distinctive features.

> I don't believe a small cost will kill all spam;

Neither do I.  The filter I've worked on doesn't aim at reducing spam, just
at shuffling it out of your inbox (at which it's very effective).

> every day I get large amounts of paper adverts, flyers, business cards
> etc etc. These have real cost, but presumably are sufficiently market
> oriented that they pay for themselves.

I'm betting you get flyers from businesses in your geographic area, and
business cards from people you meet.  That's targeted marketing of a simple
sort, but, even so, far beyond the targeting most spam does now.

> Putting a cost on email will just reduce the volume of spam.

Possibly -- most people argue that way.  I don't know.  It's conceivable to
me that the universal adoption of something like SpamBayes may even increase
the volume of spam (if messages you don't want to see are reliably filtered
out of your inbox, and messages you do want to see reliably appear in your
inbox, then the things most people call spam that are ham to you will get
seen by you more effectively than if you first had to wade thru 99.9% of the
stuff you didn't want to see -- and that may actually increase response
rate).

IOW, it's not just cost, and neither just response rate, that feeds into
this.  Net profit is what drives spammers in the end, and that trades off
several kinds of input.  For example, spam volume may even increase if cost
goes up and response rate goes down, if the mix of people who respond can be
nudged toward one willing to pay more for what you're selling.





More information about the Python-list mailing list