[Spambayes] Image spam

yahoo.de mpas1342 at yahoo.de
Sun Jun 11 09:55:56 CEST 2006



-----Ursprungliche Nachricht-----
Von: spambayes-bounces at python.org
[mailto:spambayes-bounces at python.org]Im Auftrag von Amedee Van Gasse
Gesendet: Samstag, 10. Juni 2006 21:09
An: spambayes at python.org
Betreff: [Spambayes] Image spam (Was: is the database empty)



On Sat, June 10, 2006 15:16, Amedee Van Gasse said:
>
> On Sat, June 10, 2006 14:18, yahoo.de said:
>>
>> how could i train the SB to recognize emails with advertistment images
>> for some product and so on? let see the email has no text, but onla an
>> image in the  body! (i know there are image scanner software for this
>> purpose, but what could be done in such cases)
>
> Image spam is indeed a problem. Otoh, in my personal experience it's only
> a problem in theory. In practice there are enough other spammy
> characteristics in such emails.
>
> I don't know about image scanners specifically for spam detection, but I
> think it's possible to feed emails trough such image scanners before
> they are fed to spambayes.
>
> I can imagine one could make an ocr program that converts images to text
> (if possible) and attaches the text to the email, which is subsequently
> fed to spambayes. That way, spambayes virtually "reads" the image just
> like a human does.
>
> Actually, you suggest something interesting. I'm going to try a few
> things and if they work, I'll post it on the list.

Hello again,

I found an interesting program that might be exactly what you are looking
for: ocrad. This is GNU software, can accept pbm files or standard input,
and outputs text to standard output. So this is a commandline ocr program
that can be used in a script. Don't worry about the pbm files, the ocred
manual describes how to convert other image formats to pbm (jpeg, png, ps,
pdf,...)

So what you could do in a prefiltering script (like a procmail script) is:
* extract the images from the email
* convert them to pbm
* send the pbm files to ocrad
* attach the resulting text to the original mail
* finally, let spambayes do its magic

However I am a bit concerned about performance of doing an ocr of every
single image you receive. Also I don't agree with the thesis that image
spam (or banner spam) will not be recognised as spam by spambayes. I think
spambayes *will* find enough tokens to give the mail a score that is not
unsure.
For the rare occasions that image spam will result in an unsure score, I
suggest the following strategy:

1. score the email with spambayes (preliminary score)
2. everything with score 1 is 100% sure spam (high spam), so dump it to
/dev/null (my thesis is that image spam will be caught most of the time)
3. for every mail that is "low ham", unsure, or "low spam" AND has an
image, convert the image(s) to pbm, ocr with ocred, and attach text to
email
4. rescore the email with spmabayes (final score)
5. continue with your usual filtering rules

Note:
high ham = 100% sure ham, messages with a score of 0.00
low ham = probably ham, but with a score > 0.00 (you can use other
treshold values)
low spam = probably spam, but with a score < 1.00
high spam = 100% sure spam, messages with a score of 1.00


The actual implementation of these ideas are left as an excercise to the
reader :)

--

Amedee Van Gasse
-------------------------------------
>However I am a bit concerned about performance................

I think  there is no other possibility if you will know what is in the image
mail!
You should have than more powerful features to check without losing
performance.
F.i. if in a parallel process with parallel threads Spambayes analyse more
emails
simultaneously, than the performance would not be as bad as somebody
think(just an imagination!?)
Apart from that, i had in the past a lot of spam mails with only one image
in the body,
were the filter has not a lot information about the content to verify the
email as ham/spam ! This kind of spam are at the moment very in use by
spamer. In this case has the filter only the header information, which is
alone not scignificant to make a decision.


_______________________________________________
SpamBayes at python.org
http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html


	

	
		
___________________________________________________________ 
Telefonate ohne weitere Kosten vom PC zum PC: http://messenger.yahoo.de


More information about the SpamBayes mailing list