[Spambayes] Tokenising clues
Anthony Baxter
anthony@interlink.com.au
Wed, 02 Oct 2002 00:22:16 +1000
>>> Matt Sergeant wrote
> This seems like a vast waste of your time to me. There's a couple of
> projects out there that have already spent vast amounts of time and
> programming effort into figuring out these other clues that spambayes
> misses out on. Rather than repeating that work, why not just rip all the
> rules out of SpamAssassin or some other spam checking project wholesale,
> and stuff those into your database?
The problems are that
- many of the existing tools are of the "if this header says _this_,
it indicates spamminess of -this- much". The stuff here is more
trying to work out answers that work without having to try and
produce magic numbers for what a particular header value means.
- a lot of the problems are from the testing corpuses (yes, I know
the word is corpora, corpuses looks cooler :) and the mixed nature
of them. This rules out a bunch of "obvious" tricks.
- spamassassin, in particular, is written in perl. I tried looking
through it to grok clues and started having twitches and convulsions.
Been through the perl horror, not going back :)
I couldn't find a simple doco of "here's what SA looks at" in the docs.
> Sorry, I don't want to demean any of your work, but we need to work
> together to fight spam, and I'd rather not see so much time wasted on
> individual clues when SpamAssassin already extracts about 800 of them!
The problem with SA for at least one of the applications I have is that
it's way, way too aggressive. My monster corpus is the main contact email
for the company I work for. SA kicks out far too many legitimate
commercial email messages. But that mailbox gets (in the last week)
something like 200 spams a day - probably more. Sifting through the
hits looking for the real posts is too much work.
If there is a list of existing tokenisation clues we can work from,
excellent! I know I won't mind re-using someone else's hard-won experience
in this area. :)
Anthony
--
Anthony Baxter <anthony@interlink.com.au>
It's never too late to have a happy childhood.