[Spambayes] Tokenising clues

Tue, 01 Oct 2002 15:41:56 +0100

Anthony Baxter wrote:
>>>>Matt Sergeant wrote
>>>
>>This seems like a vast waste of your time to me. There's a couple of 
>>projects out there that have already spent vast amounts of time and 
>>programming effort into figuring out these other clues that spambayes 
>>misses out on. Rather than repeating that work, why not just rip all the 
>>rules out of SpamAssassin or some other spam checking project wholesale, 
>>and stuff those into your database?
> 
> 
> The problems are that
> 
>   - many of the existing tools are of the "if this header says _this_,
>     it indicates spamminess of -this- much". The stuff here is more
>     trying to work out answers that work without having to try and 
>     produce magic numbers for what a particular header value means.

The scoring is independant from the matching. The scoring is merely a 
by-product of running the matches through the genetic algorithm - in 
order to feed that genetic algorithm we have to not care what the score 
is (as that's prior knowledge, thus bad).

>   - a lot of the problems are from the testing corpuses (yes, I know
>     the word is corpora, corpuses looks cooler :) and the mixed nature
>     of them. This rules out a bunch of "obvious" tricks.

This is suggested as an extension of what you do, not a replacement 
though. You've already got accurate code, but it seems that spamassassin 
was able to get clues from your FN's that word tokenisation has missed. 
The very nature of what you're doing will mean that if the SA rules 
aren't as accurate as the tokens you do find in an email then it won't 
matter. But it's just that little bit more information.

>   - spamassassin, in particular, is written in perl. I tried looking
>     through it to grok clues and started having twitches and convulsions.
>     Been through the perl horror, not going back :) 
>     I couldn't find a simple doco of "here's what SA looks at" in the docs.

Check the rules/ directory. You can read regexps I assume. That's all 
SpamAssassin is - a big regexp engine. There are rules that run code (we 
call them eval tests), but most of them aren't that complex, for example 
something that looks at eval:subject_all_caps() will run:

sub subject_is_all_caps {
    my ($self) = @_;
    my $subject = $self->get('Subject');

    $subject =~ s/^\s+//;
    $subject =~ s/\s+$//;
    return 0 if $subject !~ /\s/;	# don't match one word subjects
    return 0 if (length $subject < 10);  # don't match short subjects
    $subject =~ s/[^a-zA-Z]//g;		# only look at letters
    return length($subject) && ($subject eq uc($subject));
}

if you change all the arrows to dots, and remove all the dollars, 
semi-colons and curly brackets, you get:

sub subject_is_all_caps
    subject = self.get('Subject')

    subject =~ s/^\s+//
    subject =~ s/\s+$//
    return 0 if subject !~ /\s/	# don't match one word subjects
    return 0 if (length subject < 10)  # don't match short subjects
    subject =~ s/[^a-zA-Z]//g		# only look at letters
    return length(subject) && (subject eq uc(subject))

It's almost like python! ;-)

>>Sorry, I don't want to demean any of your work, but we need to work 
>>together to fight spam, and I'd rather not see so much time wasted on 
>>individual clues when SpamAssassin already extracts about 800 of them!
> 
> The problem with SA for at least one of the applications I have is that
> it's way, way too aggressive.

So up your threshold, or train it yourself. Isn't that what you're doing 
with spambayes?

> My monster corpus is the main contact email
> for the company I work for. SA kicks out far too many legitimate 
> commercial email messages. But that mailbox gets (in the last week) 
> something like 200 spams a day - probably more. Sifting through the 
> hits looking for the real posts is too much work.
> 
> If there is a list of existing tokenisation clues we can work from,
> excellent! I know I won't mind re-using someone else's hard-won experience
> in this area. :)

Yep, check the rules/ directory. Particularly the 20_* files, which are 
the header, body and rawbody rules (don't worry about the distinction 
between body and rawbody for now - it's really rather bogus ;-)

Matt.