[Spambayes] Tokenising clues

Anthony Baxter anthony@interlink.com.au
Wed, 02 Oct 2002 00:22:16 +1000


>>> Matt Sergeant wrote
> This seems like a vast waste of your time to me. There's a couple of 
> projects out there that have already spent vast amounts of time and 
> programming effort into figuring out these other clues that spambayes 
> misses out on. Rather than repeating that work, why not just rip all the 
> rules out of SpamAssassin or some other spam checking project wholesale, 
> and stuff those into your database?

The problems are that

  - many of the existing tools are of the "if this header says _this_,
    it indicates spamminess of -this- much". The stuff here is more
    trying to work out answers that work without having to try and 
    produce magic numbers for what a particular header value means.

  - a lot of the problems are from the testing corpuses (yes, I know
    the word is corpora, corpuses looks cooler :) and the mixed nature
    of them. This rules out a bunch of "obvious" tricks.

  - spamassassin, in particular, is written in perl. I tried looking
    through it to grok clues and started having twitches and convulsions.
    Been through the perl horror, not going back :) 
    I couldn't find a simple doco of "here's what SA looks at" in the docs.

> Sorry, I don't want to demean any of your work, but we need to work 
> together to fight spam, and I'd rather not see so much time wasted on 
> individual clues when SpamAssassin already extracts about 800 of them!

The problem with SA for at least one of the applications I have is that
it's way, way too aggressive. My monster corpus is the main contact email
for the company I work for. SA kicks out far too many legitimate 
commercial email messages. But that mailbox gets (in the last week) 
something like 200 spams a day - probably more. Sifting through the 
hits looking for the real posts is too much work.

If there is a list of existing tokenisation clues we can work from,
excellent! I know I won't mind re-using someone else's hard-won experience
in this area. :)

Anthony

-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.