[Spambayes] Spam Filter Based on Bayesian Techniques

Sun Jan 25 16:25:56 EST 2004

> I was wondering if anyone could guide/point me an interesting
> direction in Spam Filtering which may be new and I could implement
> as part of my course. For example: Any kind of performance
> evaluation etc.

In my opinion, the most interesting way to improve SpamBayes would be
to make it work well with web pages. There's some spam around these
days that tries to evade filters like SpamBayes by containing only
bland words and pointing to a web page that, presumably, contains the
real advertisement. Something like:

Hey, dude, check this out:

http://www.example.com/

Various people have found that it's effective to retrieve and score
web pages that are linked from emails that score in an "unsure"
range, typically something like 0.2 to 0.8. That often works. But
it's sometimes wrong and sometimes right for the wrong reason. I've
only glanced at the data, but it seems that that's at least sometimes
because web pages don't look all that much like emails and so scoring
a web page against tokens from emails doesn't produce results that
are as good as the algorithm is capable of.

There are some interesting challenges:

Where to get a supply of "hammy" web pages for the individual user

How to store the tokens from the web pages so that a message can be
re-classified since the pages are apt to change or disappear

And, no doubt, any number of others

Regards,
Matt