[spambayes-dev] message subject filtering

Tue Aug 31 19:38:38 CEST 2004

message subject filtering

I'm not a programmer.
I have asked a similar question before, but recently, mounting spams have
made me alter it significantly, and so I hope it worthy of reconsideration.
A lot of spam shows:

* Ungrammatical and/or irrelevant wording
* Random words
* Gibberish words
* Deliberately weird or obscure punctuations
* Since this is true in the header as well as the text body, this
potentially reduces the loads on the filter.  Random words not seen before
seem to allow stuff through more easily.  Therefore the presence of these
certain features, I don't know for sure if they fall under the definition of
tokens, are high probability signals.  Is this what's new -error signals
computed from the entire (or a substantial subset of) message?
* I also note spam outnumbers ham by up to 100 to one, so header filtering
seems good at throwing up warnings.

And invariably the text body contains the web address of the seller, so a
web address of itself is a giveaway.

I am fast at identifying spam by the header alone, using the above
observations I reckon I spot 90% plus in a blink.   However it's still a
pain rubbing them out.

It seems to me that application of rules based the above would be a more
sophisticated way of developing spambayes.
I think the analysis of text would then focus in better on the more subtle
forms of spam, using the tokens to greater effect.  Apologies if any of this
is rubbish or goes against the theories.

Kind regards,
John Moriarty
(+353) (0)87 2833 530
www.helimodels.com