[spambayes-dev] Spambayes is starting not to work due to retaliatory action by spammers

Wed Aug 2 09:45:03 CEST 2006

Dear Spambayes developers,

I've used Spambayes for 2 or 3 years (Outlook add-in) - it has been
excellent.  However, over the last couple of months, it has become
compromised by a particular type of spam that I believe, over time, will
render Spambayes much less effective unless something is done.

I expect you've seen these Spams - at the moment, they are always the
stock-market related ones but I'm sure once others catch on, they will start
to use the same technique.  The start of the email is a picture that looks
like ordinary text but isn't.  All the spam info is in the text.  The
picture is followed by a whole load of randomly selected words.

There are 2 bad things about this:

1.  These spams are successfully evading Spambayes in some cases.  Firstly
the Spam usually reaches the "possible Spam" folder.  As a result, I am now
spending significant time clearing out the possible spam folder whereas 2 or
3 months ago I wasn't.   Secondly, the odd spam is actually managing to get
through as ham.  This is the first time this has happened ever.

2.  Because I obviously mark these as Spam, all the randomly generated words
in each spam email have their spam likelihood scores increased.  The result
of this is that over time, the spam-scores for loads of perfectly
non-spam-like words are being gradually increased.  The more this goes on,
the more these "ham words" are being compromised.  I suspect that this is
why, to begin with, I only saw a few of these stock market emails, now I'm
seeing loads and over the last 2 or 3 weeks some have started to come in as
ham.  I fear that the long term effect of this will be to spoil spambayes
bigtime.

I know that Spambayes has a deep-rooted principle in only using the bayesian
algorithm and I wouldn't suggest changing that.  However, I am wondering if
it might be possible to analyse these messages and include some parts of the
hidden text relating to the picture that are not presently included in the
bayesian statistics.  My thesis is this - I rarely get pictures in my email
that are not just attachments - virtually all pictures that are embedded
into the mail seem to be spam.  So if there is some token or tag in the
email that represents the embedded picture that can be included in the
bayesian analysis, this would might fix the problem.

I hope that this suggestion is useful - I certainly fear for the future of
Spambayes if this new spam threat is not dealt with....

thanks for reading,

James Masters.