OT: spam filtering idea

Paul Rubin phr-n2002b at NOSPAMnightsong.com
Mon Jan 13 10:26:00 EST 2003


Yeah there are better newsgroups for it, but this is where I hang out
and the subject has come up here.  I just thought of it after seeing
the "join our community" thing here on c.l.py.

Anyway, I wonder if the following is a worthwhile hack to improve
Bayesian filtering.  Maybe it's already being done--I haven't seen it
done quite this way, but I'm not a spam filtering guru.

The idea is to run the probability coefficients through a digital
filter, so the probabilities decay over time.  That is, you give
special emphasis to words found in RECENTLY RECEIVED spam.  If you get
message M with the words "banana", "elephant", and "doorknob", that
doesn't make M is especially likely to be spam.  But if you got a
piece of spam YESTERDAY with that combination of words, then M is
almost certainly also spam.  That lets you crank up the probabilities
for newly arrived spam words to considerably higher levels than you'd
trust in a quasi-static corpus (keep "elephant" high on your list for
too long and it may create false positives later).

The next step is to collect the frequency statistics at various
honeypots around the net, automatically combining them and transfering
them to public databases.  Your filter can then retrieve new
statistics over the net every few hours.  Any spam you receive will
probably also hit a honeypot at about the same time that you get it.
So since the statistics you've retrieved are weighted for the latest
and freshest spam, you should be able to kill it very effectively.

In case you get the spam faster than the honeypots do, you may not
want to immediately Bayes-filter all incoming mail into spam- and
non-spam folders.  Instead, you'd only immediately deliver mail from
addresses on your whitelist.  Anything else, you'd hold for say 6
hours, then run it through the Bayes filter for categorization.  Since
spam tends to be sent out in batches a few hours long, that delay
should be enough for the honeypots to receive it and update the
databases.

Thoughts?

Feel free to crosspost replies to an anti-spam newsgroup; I don't know
which one to use.




More information about the Python-list mailing list