[Spambayes] Server-side setup for corporate usage

Tue Dec 30 12:05:07 EST 2003

Hello,

I've set up a server-side SpamBayes filter system.  This is probably 
breakable, and could use some improvement.  It's also not done yet. 
When things are complete, I'll stick the outline up somewhere on the web.

Here's the platform & stats:
Pentium 700 MHz, 128 MB Ram, 1 IDE HD
SuSE Linux, Postfix, Cyrus
Roughly 10 to 20 thousand emails / day, mostly spam :(
Load average: 1.5 to 4.0 (never below 1.2)
Postfix queue limited to three days

main.cf:
mailbox_transport = cyrus

master.cf:
smtp inet n - n - 12 smtpd -o content_filter=spambayes:
smtp unix - - n - 12 smtp
cyrus unix - n n - 12 pipe
   user=cyrus argv=/usr/lib/cyrus/bin/deliver -e -r ${sender} -m 
${extension} ${user}
spambayes unix - n n - 12 pipe
   user=nobody argv=/usr/bin/hammiefilter.sh $sender $recipient

The third line of the "cyrus" entry belongs at the end of the second line.

Note the process limits of 12.  Default is 100, which brings the system 
to a crawl (load average: 80+ without spambayes).  YMMV, esp. with SMP.

To newbies: note the different "smtp inet" and "smtp unix" lines.  That 
one threw me for a couple days.  The instructions in the FAQ (?) on the 
SpamBayes website show the smtp inet line.  Don't edit the smtp unix 
line, because it won't work.

hammiefilter.sh is attached.  It is an adaption of the hammiefilter 
found in the server-side setup instructions on the SpamBayes website.

To populate the user-specific databases, I use a Perl script (also 
attached).

The way it works:
1) Postfix receives an email
This next part I'm not quite sure about, but anyway...
2) Postfix uses the 'cyrus' transport,
3) which calls "deliver"
4) which uses the "smtp inet" transport
5) which calls smtpd -o content_filter (which filters mail text through 
an external filter)
(I'm pretty sure about the rest)
6) which uses /usr/bin/hammiefilter.sh to call sb_filter.sh
7) which uses /var/spambayes/hammie-$username.db to add an 
X-SpamBayes-Classification header to the email.
8) Something magic happens, and the mail arrives in my inbox.

For training:
1a) User receives spam
1b) User receives ham
2a) User forwards said spam to spam at domain.com (domain is the client's 
mail domain -- i.e., if I set this thing up for python.org, said user 
would forward to spam at python.org)
2b) User forwards said ham to ham at domain.com
3) Perl script runs every 10 minutes, checks the ham and spam accounts, 
and trains each messages appropriately against 
/var/spambayes/hammie-$username.db.

Make sense?  Me neither.  I'm sure it will, though.

I've had the standard excellent results (with the notable exception of 
aforementioned identity theft scams, except they're well-done Ebay 
scams, rather than paypal.)

As I side note: I set up Mozilla Thunderbird to label messages with 
different colors based on the value of the X-SpamBayes-Classification 
header.  Thus, it's trivial to train on unsures (they show up orange), 
false positives, etcetera.

Question: Will extra data resulting from forwarding (such as the 
"---Original Message---" line placed by Thunderbird) poison the 
database?  If I train with equal spam and ham, it *shouldn't* -- am I 
correct?

Chris
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hammiefilter.sh
Type: text/x-sh
Size: 540 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20031230/1458ce36/hammiefilter.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: poll_ham-spam_mboxes.pl
Type: text/x-perl
Size: 2368 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20031230/1458ce36/poll_ham-spam_mboxes.bin