[Spambayes] Tony Meyer - Training question

Sun Sep 18 12:31:21 CEST 2005

> Enabling all the options is also not something I'd recommend.  As  
> part of the 2005 TREC spam track, one of the SpamBayes runs submitted  
> included enabling all boolean options (except the slurping ones).   
> Results from TREC aren't complete yet, but initial testing indicates  
> that this run performs worse than running with defaults.

Outstanding information!  When will the TREC test data be available for
review concerning SpamBayes and the different runs with different options?

> x-reduce_habeas_headers: True
> x-search_for_habeas_headers: True

> It's pretty clear that Habeas's headers are a failed experiement.   
> These options probably aren't worth including, and are likely to be  
> removed in a future release.

Thanks for the info.

> basic_header_tokenize: True
> basic_header_skip: date x-.* domainkey-signature

> Testing hasn't shown that basic_header_tokenize is a good idea.  Is  
> there a reason you turned it on?

Yes.  The reason I enabled this option is because I am filtering essentially
two different pop3 accounts (one is forwarded to my main account).  One is
virtually all spam (my client), the other all ham (my personal account), so
I've found this beneficial to my personal mail stream.  Before when I did
not have this option enabled, when a ham source forwarded a spam, it was
classified as spam.  Now with header tokenize on, this is not the case.

I've also found that this helps tremendously with nailing phishing scams as
well.

> address_headers: from sender reply-to errors-to

> I don't have any testing to hand about this, but I doubt that  
> removing "to" and "cc" from the headers that are tokenized is a good  
> idea.  For me, at least, the data in the "to" and "cc" headers is  
> definitely a good indicator of whether the message is ham/spam; I  
> would expect this would be the case for many people.  Adding errors- 
> to might help; I don't know if any testing has been done on that.

> generate_long_skips: True

Ya, I found out later that this option did nothing.  As soon as you enable
basic_header_tokenize: True it will add "to" and "cc" ect...which I found is
a really good indicator like you said.  Does this mean I can take this out?

This is the default; it will have no effect.

> skip_max_word_size: 50

> I believe that (in the early days) there was a lot of testing to  
> determine what the best minimum and maximum token sizes were.  50 is  
> a *lot* better than the default 12 - do you really have many strong  
> tokens longer than 12?

LOL!  The reason I set it to 50 was because I read some good advice on the
mailing list about the longest English word was very close to 50 characters
and some good clues may be had if it was set higher.  I've noticed that some
of the words in the "spammer salad" to throw off filters has characters over
12.  Thoughts on this?

> [URLRetriever]
>
> x-cache_directory: url-cache
> x-cache_expiry_days: 31
> x-only_slurp_base: True
> x-slurp_urls: True
> x-web_prefix:web:

> I would not recommend enabling these without understanding what they  
> do.  The main issue is that as a result of enabling them, SpamBayes  
> will be downloading a lot of extra material - for those where  
> connection speed or bandwidth are issues, this might not be a good  
> step.  It's also not at all clear that they are beneficial - without  
> the only_slurp_base option, testing generally indicates good results,  
> but that means that any 'bugs' will be triggered.  With the  
> only_slurp_base option, results are mixed, leaning towards negative.

Again, thank you for the info!

Erik Brown