[spambayes-dev] some tokenising ideas for someone who wants to experiment

Anthony Baxter anthony at interlink.com.au
Thu Jun 16 09:05:05 CEST 2005


Here's a couple of ideas for tokenising/scoring messages that someone might 
like to experiment with. I have no time in the next few months, but if I send 
them here, they won't just disappear into the vagaries of my long term 
memory. <wink>

multipart/alternative:

   When confronted by a multipart/alternative, score each alternative 
separately, and keep the highest score only. Discard the scoring from the 
lower scoring part(s). I'm seeing a _lot_ of spam with pure wordsalad 
text/plain, and spam text in the html only. 

stylesheet interpretation:

   There's probably some moderate wins in parsing (to a small degree) inline 
CSS in text/html - at least to remove the stuff which has been styled 
'hidden'.

Got your own ideas for tokenising tricks that are worth trying? Post them, we 
can collect them somewhere for people who want to experiment... 

-- 
Anthony Baxter     <anthony at interlink.com.au>
It's never too late to have a happy childhood.


More information about the spambayes-dev mailing list