[Spambayes] Spam Clues: test

Thu Jan 27 06:17:10 CET 2005

> mail from this sender always goes to my junk suspects,
> though I ALWAYS mark it to "recover from spam"-  any
> suggestions as to how to make it not be spam?
> 
> Combined Score: 38% (0.384155)
> Internal ham score (*H*): 0.891505
> Internal spam score (*S*): 0.659815
> 
> # ham trained on: 23653
> # spam trained on: 1734
> 
> 6 Significant Tokens
> token                               spamprob         #ham  #spam
> 'to:addr:dave'                      0.00140231        160      0
> 'message'                           0.235257         7317    165
> 'from:none'                         0.616347          280     33
> 'message-id:invalid'                0.727912          280     55
> 'subject:test'                      0.799655           13      4
> 'to:no real name:2**0'              0.923187         1314   1159

Very short messages are very difficult, because there is not much for
SpamBayes to work with.  Messages that never travel through the Internet
(Exchange only ones like this) are especially difficult, because there are
no headers to generate tokens from.  It doesn't help right now, but the 1.1
SpamBayes release tries harder to generate tokens from the Exchange data
available, which should help a bit.

Right now, however, the thing that would help the most is to retrain from
scratch.  We recommend that people try and keep their database roughly
balanced between ham and spam (yours is about 14::1).  If the database is
significantly imbalanced you get oddities like the 'to:no real name:2**0'
token, which has been seen in more ham than spam, but is a significant spam
clue (because comparitively little spam has been seen).  With a balanced
database all the significant tokens in that message would have been ham
clues (it probably would have been a solid 0%).  (25000-odd messages is also
a fairly large database - we find that people generally get good results
with smaller databases (say under 1000 messages).

There's a lot of information about training techniques at
<http://entrian.com/sbwiki/TrainingIdeas>, but the simplest one to use with
Outlook is:

 * Remove your existing database (if you like, simply rename it and then you
can always revert to it if you want).

 * Only train on messages that end up in your unsure folder, good messages
that end up in the spam folder, and spam messages that stay in a (watched)
good folder.  If you end up getting only spam messages in your unsure folder
(after a while), consider lowering the threshold (say to 80%, or maybe 70%).

I hope this is of use!

=Tony.Meyer

-- 
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.