[Spambayes] More on 'Spammer Attempts to Circumvent Bayesian Filter'

Wed Jul 21 06:22:58 CEST 2004

[Richard B Barger]
> ...
> I still speculate that, over a large enough number of users, the longer the
> "normal-seeming" narrative, the more hammy the message appears to their
> individual SpamBayes tokenizers.

I've run experiments scoring messages composed entirely of words
chosen at random from an English dictionary.  The score strongly
tended toward 0.5 (perfectly unsure) as message length increased.  At
small sizes, they appeared hammy, but that turned out to be because
the classifier was picking up systematically hammy header clues saying
(in effect) "oh, Tim's sending email to himself again".  So if someone
else wants to try this, it's a good idea to comment out the line in
classifier.py that calls the header tokenizer.

That's not the same as testing "normal-seeming narratives", though. 
In order to test that, you'd have to do two things:

1. Specify an algorithm capable of producing a large number of
normal-seeming narratives, or collect a large number of them from
"real life".

2. Convince a large number of users it's worth their time to test it
against their trained SB classifiers <wink>.

For most people, I expect their SB classifier to be especially tuned
to the language and devices of advertising as "spammy".  I know mine
is.  Consequently, long, soft-sell, rambling, "just folks" spam
(written as if sent by, say, a non-insane old acquaintance) has a good
crack at scoring Unsure.  I don't think that makes for effective email
bulk *advertising*, though, and I don't see much of that (spam is
cheap to send per email, but it's not free, and most spam campaigns
die out quickly because they don't repay their costs).