[Spambayes] date for new release to handle image spam?

David Abrahams dave at boost-consulting.com
Thu Feb 1 17:55:32 CET 2007


"Seth Goodman" <sethg at goodmanassociates.com> writes:

<snip good stuff about how much more amazing the visual cortex is
than any OCR algorithm can be>
Yeah, sure, I know all that.

> Make OCR as "spam-specific" as you like, but it will require
> tweaking each time spammers change to an unusual font, background
> noise or text distortion.

Not necessarily.  There is voice recognition software that's resilient
against minor variations in accent, noise, and distortions.  In
principle, the same could apply to OCR spam recognition, given the
right models, so it wouldn't be "each time."

> I don't want to seem morose about this, but I don't believe it's a
> battle we can ultimately win.  It can still assist Spambayes
> classifying messages with image spam, but it's not a silver bullet.

Yeah.  The problem I'm having right now, I think, is that in those
messages where the image spam isn't successfully OCR'd, the garbage
words around the image get trained and degrade the overall performance
of my system.  Of course, that's just a guess, but it sure seems like
these days a lot more plain spam messages that ought to be recognized
as such are sneaking through than used to.

> This is really a problem to be solved at the MTA with stricter
> connection rules.  

What did you have in mind?

> Nonetheless, I suspect that Spambayes could improve
> by creating more synthetic tokens that describe the image better and
> taking advantage of serendipitous differences between tokens for image
> spam and those in each user's ham.  I'm not sure what those attributes
> are, but it probably beats trying to keep up with a quickly evolving
> captcha.  Outlook doesn't help the situation, as it destroys much of the
> MIME armor that might provide useful spam clues.

Fortunately, I'm not an Outlook slave.

-- 
Dave Abrahams
Boost Consulting
www.boost-consulting.com



More information about the SpamBayes mailing list