[spambayes-dev] Network checks

Sat Sep 20 15:09:43 EDT 2003

That's cool and all ... I mean, you've done a good job writing this book, but me and my users are happy with Spambayes. I just want some help developing a lightweight version.

Mike 

--------------------------
Sent from my BlackBerry Wireless Handheld

-----Original Message-----
From: Sean R. Lynch <seanl at chaosring.org>
To: spambayes-dev at python.org <spambayes-dev at python.org>
Sent: Sat Sep 20 14:24:52 2003
Subject: [spambayes-dev] Network checks

In attempting to develop an integrated mail system that the average person
can use, I've come to the conclusion that bayesian filtering alone just
isn't enough.

The main problem is the training time. Bayesian filters work best when
they are trained on the user's mail *and* the training set is accurate.
When experimenting on my dad, I have found the training set that he
developed to be far from what you and I would consider accurate; he
considers stuff he's interested in to be non-spam regardless of how spammy
it is, and non-spam that he's not interested ends up in his spam folder.
However, if something ends up in his spam quarantine, he will leave it
there unless it's really something he's interested in, because of the
extra effort to release it from quarantine.

What this seems to indicate is the best way to develop a good training set
for my dad is to have a good filter to begin with. SpamAssassin seems like
it would be reasonable, but if I'm gonna use SpamAssassin, why not use its
built-in Bayesian filter? The main reason I won't is that I really want to
use SpamAssassin's network checks, and IMHO it's bad netizenship to run
them more than once on the same message, and enough messages go to
multiple users on my server that I'd really like to run SA as a content
filter.

I think that Bayesian filters really need to include their training time
in performance analyses, rather than just comparing their ultimate
performance after being trained. The "best" of the Bayesian filters seem
to require the longest training times, and I don't really consider this to
be a good thing, because "training time" really translates to both false
positives and false negatives (an unsure is a false negative as far as I'm
concerned).

If IP addresses, email addresses (in the body), domains, and URLs could be
shared among users of Bayesian filters, I think this would reduce training
time significantly, because there are large numbers of each of them out
there, but they have the potential to be the biggest spam clues.

For relay IP addresses, I've been thinking of just keeping counts of spam
and ham for each of them and using DNS TXT records to distribute this
information. The counts would be submitted via a CGI or XMLRPC or
something, and the DNS zone would be regenerated every hour. This would
not be a blacklist and it wouldn't say anything technical or moral about
the host listed, just that people marked this many non-spams and this many
spams from this host.

Email addresses, domains, and URLs are harder, because IMHO they can
really only be used as spam clues if they're going to be shared. These
could be done by comparing email addresses and URLs in the message to
blacklists, and using the result as a feature for the Bayesian filter.
This way, the spammer could include as many non-spam URLs and emails as
they wanted without being able to tip the balance toward non-spam.

The other things I was thinking of including are phone numbers and snail
mail addresses, because these would cover a large number of the spams that
don't have URLs or email addresses in the body. Almost all spams have
*some* sort of contact information, unless they're chain letters, which
can be filtered out by other means. 

All of these checks could be integrated into SpamAssassin (does SA already
check URLs and stuff in the body against blacklists?), but I think it
might be better to use them to generate more features for the Bayesian
filters to use for classification... some sort script that just adds a
bunch of keywords to the headers based on the result of network checks.
This combined with a pre-trained global database that only handles
features that are missing from the user's own database (ala spamprobe)
would be great for a commercial spam filtering engine that requires no
training time to be decent, and becomes very good with only a little
training.

I'll post some code eventually, but it would be great to get some feedback
on the idea before I start coding. I am thinking about doing the relay
statistics service first since that would be fairly widely useful.

_______________________________________________
spambayes-dev mailing list
spambayes-dev at python.org
http://mail.python.org/mailman/listinfo/spambayes-dev