[Spambayes] RE: Trapping Spam messages that contain images...

Tue Oct 19 01:45:00 CEST 2004

I am using the Spambayes Outlook plug-in, and I have found Spambayes to be a
saviour in the fight against Spam mail. 

However these days I am receiving a new kind of Spam that sneaks through my
defences and Spambayes cannot trap. These Spam basically look like a normal
email, containing randomly generated sentences, but with no Spam style
phrases to train on like "viagra" or "cheap software" etc. Each email
contains a reference to an image on a remote site, with the image itself
containing the advertising text. In my case "Viagra $1.39", or whatever the
going price is that day.

Spambayes rarely traps these Spam as "Junk E-Mail" and quite often doesn't
even trap them as "Junk Suspects", because the Spam score on these messages
ranges anywhere between 0% and 100%. To reduce the impact of the problem
I've decreased my "Certain Spam" threshold right down to 20%, but sometimes
these Spam still come through with a lower Spam score. Plus by setting my
"Certain Spam" threshold so low, I am running the risk that Ham could be
classified as Spam.

Anyhow, I had an idea to trap these type of messages that I thought I might
put out for discussion. It would probably involve quite a level of effort
but I believe it would close a loophole that Spam senders are currently
using. Basically it would involve some additional functionality allow OCR
processing of images that are referenced on emails.

When an email is received if there are any html references to images, or
images actually contained in the email, then the image(s) would be retrieved
and processed via an OCR algorithm. Then the resulting text (if any) could
be checked for Spam in the same way done currently by Spambayes.

Now I realise this would likely have a performance impact on email
processing, and also increase a users data usage as it could result in
Spambayes retrieving images from the internet to scan for character data. To
reduce this impact a number of options could be added to Spambayes...

1. By default OCR processing is switched off. Then the people like me who
receive these Spam that contain images could turn it on only if required.

2. An option to specify that OCR processing of attached/referenced images
only be done in the background, in a similar way to the current Spambayes
background filtering option. This would mean that when emails first arrive
there would not be a pause while the images are being scanned.

3. A threshold to specify when and when not to perform OCR processing on
emails, similar to the existing thresholds for Spam and Suspect Spam. For
example....

Spam Threshold:    90%
Suspect Threshold: 15%
OCR Threshold: 5%

Anyhow, it's just a thought. Maybe it's already be considered. What do you
think?

regards,
Michael