[Spambayes] Ideas for an MSc project please...

Christopher Jastram cej at intech.com
Mon Feb 9 01:00:17 EST 2004


dont bother wrote:

>>4) Improving Bayesian spam filtering at the SMTP
>>gateway level. Why is
>>it less effective, what can be done to improve it,
>>    
>>
>
>Hey can you elaborate on that? I am a newbie so if you
>could explain me step by step on this, it would be
>great
>Thanks
>dont
>  
>
Sure.

Providing a point-and-click installer that makes "Delete as Spam" and 
"Recover from Spam" buttons magically appear on the Outlook toolbar is cool.

Asking users to forward spam to "spam at company.com" and an equal amount 
of ham to "ham at company.com" is a PITA for all involved.  (Never mind 
trying to explain what "ham" is...)

Also, server-side filtering is a total f**k to set up (pardon the 
profanity), especially in a user-specific manner (since Bayesian 
filtering really doesn't work using the same database for multiple 
users).  It also takes up a snotload of resources, which is Not A Good 
Thing(tm) on a busy mail server.  For example, before the MyDoom virus, 
we were processing 10 to 11 thousand emails every day.  When MyDoom hit, 
we started processing 350 thousand emails.  Filled up the SYN_RECV 
queue, and took the machine (and our network) to its knees.  The first 
thing I did was strip the bayesian filtering out, and promptly watched 
the mail thoroughput quadruple.  Server-side bayesian filtering (or any 
content filtering, for that matter) is *expensive*.  We are currently 
purchasing two 64-bit AMD 3GHz machines with mirrored hard drives to 
handle this kind of load, because we CAN NOT let valuable mail bounce.  
(We were running a 667 MHz Celeron w/ 128 mb ram.)

Hope this hard-edged voice of experience helps a little.  :)

Christopher Jastram



More information about the Spambayes mailing list