[Spambayes] Matt Sergeant: Introduction
Matt Sergeant
msergeant@startechgroup.co.uk
Mon, 30 Sep 2002 13:49:46 +0100
Just wanted to stick my hand up and say "Hi". I've been following this
list on gmane.org for a while now (it's a mail to nntp gateway for those
interested in following multiple technical mailing lists in a read-only
fashion), but decided it was time to bite the bullet and subscribe,
mostly because I keep itching to reply to certain posts :-)
First of all, I'm a perl guy (boo, hiss). Some of you may or may have
heard of me from the perl community (I won the ActiveState awards for
Perl and XSLT this year, though I didn't really deserve the XSLT one
<grin>). I'm also actively involved in the SpamAssassin project. By day
I work as an anti-spam technologist (one of the lucky few to actually
work day-in day-out to fight this nuisance) for MessageLabs. I was doing
Bayesian probability techniques for spam detection before Paul Graham
published his article and set the world on fire. Like you all, I
discovered very quickly that it's the tokenisation techniques that are
the biggest "win" when it comes down to it.
To answer the curious, yes we're going to add some sort of bayesian
technique to spamassassin - one of our developers has written code
independantly from my stuff at work (because I'm contract bound not to
give that away) that uses SpamAssassin in a "training" mode before
switching on to full scoring mode. It'll basically work much like the
other SA rules, where the probability gives a score. If we go with
Graham it'll likely be boolean, if we go with Robinson we can give a
gradual score range. But that's mostly out of my hands (due to day-job
conflicts).
Anyway, just wanted to say "Hi", and to let you know that I have
converted the PG (final) code to Perl, and it worked well, and I've done
the Robinson stuff without the central limit theorem and it didn't work
quite as well, so I'm hopefully going to get CLT done this week and see
how it fares. Unfortunately I find python incredibly difficult to read,
so it takes me a while!
Oh, and I'll also be talking at Paul Graham's spam conference about
doing spam detection at the internet level (at messagelabs you point
your MX to us, and we "clean" your email then forward it on), and the
issues that gives rise to, such as how the probability stuff works so
much better on individuals' corpora (or on a particular mailing list's
corpus) than it does for hundreds of thousands of users.
Have fun,
Matt.