[Spambayes] Matt Sergeant: Introduction

Matt Sergeant msergeant@startechgroup.co.uk
Mon, 30 Sep 2002 13:49:46 +0100


Just wanted to stick my hand up and say "Hi". I've been following this 
list on gmane.org for a while now (it's a mail to nntp gateway for those 
interested in following multiple technical mailing lists in a read-only 
fashion), but decided it was time to bite the bullet and subscribe, 
mostly because I keep itching to reply to certain posts :-)

First of all, I'm a perl guy (boo, hiss). Some of you may or may have 
heard of me from the perl community (I won the ActiveState awards for 
Perl and XSLT this year, though I didn't really deserve the XSLT one 
<grin>). I'm also actively involved in the SpamAssassin project. By day 
I work as an anti-spam technologist (one of the lucky few to actually 
work day-in day-out to fight this nuisance) for MessageLabs. I was doing 
Bayesian probability techniques for spam detection before Paul Graham 
published his article and set the world on fire. Like you all, I 
discovered very quickly that it's the tokenisation techniques that are 
the biggest "win" when it comes down to it.

To answer the curious, yes we're going to add some sort of bayesian 
technique to spamassassin - one of our developers has written code 
independantly from my stuff at work (because I'm contract bound not to 
give that away) that uses SpamAssassin in a "training" mode before 
switching on to full scoring mode. It'll basically work much like the 
other SA rules, where the probability gives a score. If we go with 
Graham it'll likely be boolean, if we go with Robinson we can give a 
gradual score range. But that's mostly out of my hands (due to day-job 
conflicts).

Anyway, just wanted to say "Hi", and to let you know that I have 
converted the PG (final) code to Perl, and it worked well, and I've done 
the Robinson stuff without the central limit theorem and it didn't work 
quite as well, so I'm hopefully going to get CLT done this week and see 
how it fares. Unfortunately I find python incredibly difficult to read, 
so it takes me a while!

Oh, and I'll also be talking at Paul Graham's spam conference about 
doing spam detection at the internet level (at messagelabs you point 
your MX to us, and we "clean" your email then forward it on), and the 
issues that gives rise to, such as how the probability stuff works so 
much better on individuals' corpora (or on a particular mailing list's 
corpus) than it does for hundreds of thousands of users.

Have fun,
Matt.