[Spambayes] Re: Training empty messages problem

Tue Dec 7 03:33:42 CET 2004

> This looks like the best solution for me for now. I tried
> looking up some information about the safe_headers option in 
> the [Tokenizer] section but couldn't find none.

It's not particularly easy for Outlook users to find, since there isn't
generally any need for them to know.

> What exactly
> is the syntax for adding this option to the tokenizer section 
> in my Outlook.ini file in the data folder?

Firstly, you need to modify the "default_bayes_customize.ini" file in the
data folder, not the "Outlook.ini" one.  (If there isn't an existing file
(we stopped adding it by default a few versions back) then just create one
with notepad or something like that).  At the end of the file, add the
lines:

"""
[Tokenizer]
safe_headers:abuse-reports-to,date,errors-to,from,importance,in-reply-to,mes
sage-id,mime-version,organization,received,reply-to,return-path,subject,to,u
ser-agent,x-abuse-info,x-complaints-to,x-face,x-exchange-message
"""

Don't include the """'s, and the safe_headers line will need to be put back
together into one (very long) line after my mailer splits it up.  These are
all the default headers, plus the additional x-exchange-message one.

> After I've added
> this option I have to retrain on all of my ham and spam message?

It's up to you.

If you do, then the token ('header:X-Exchange-Message:1') will reflect the
proportion of all trained messages that are Exchange only and spam (I would
imagine it'd have a very low score).

If you don't, then the token will be like any other new token, and score
exactly 0.5 (unused in classification).  As you train new Exchange only
messages, the token's score will be adjusted appropriately, until it's
strong enough to be useful in classification (not long if there are barely
any other tokens for a particular message).  Eventually (assuming no change
in email pattern) the score will approach the score that it would have had
if you had done a complete retrain.

Note that I'm still not sure what's causing the "message-id:invalid" token
to be so strongly ham, which is also effecting the classification of these
'empty' messages.

=Tony.Meyer

-- 
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.