[Spambayes] Spam Clues: Re: index.cgi redirection
Kenny Pitt
kennypitt at hotmail.com
Wed Nov 12 11:13:25 EST 2003
Grützmacher, Lukas wrote:
> Even I have trained SpamBayes with many (over 100) mails from this
> list as good the most of them are identified as "possible spam". I'm
> currently not able to understand why.
>
> 1) Can you explain me what parts of the Spam Clues are calculated to
> reach the 0.386739 (below) for the mail ? (I could not found any
> description in the documentation !?) 2) Is it a problem of SpamBayes
> or of the list or of my configuration ?
>
> Spam Score: 39% (0.386739)
>
>
> word spamprob #ham #spam
> 'proto:http' 0.614138 989 127
> 'can' 0.61691 415 54
> 'are' 0.631922 418 58
> 'you' 0.63923 601 86
> 'header:Date:1' 0.647151 912 135
> 'header:From:1' 0.647151 912 135
> 'header:Return-Path:1' 0.653197 888 135
> 'header:Message-ID:1' 0.656727 738 114
> 'to:no real name:2**0' 0.679622 677 116
> 'header:Received:3' 0.812182 236 83
How many total hams and spams have you trained on? The clues I left
above particularly stood out to me because you are getting relatively
high spam probabilities even though the ham counts are much higher than
the spam counts. This usually indicates that you have unbalanced
training data where you have a lot more messages of one type than the
other. In this case, I would guess several thousand hams vs. only a
couple hundred spams.
Unbalanced training data can cause accuracy problems, and in particular
can make it difficult for additional training to overcome the effects of
words that appear in both ham and spam. All probabilities are based on
ratios, not absolute numbers. For a given word, the raw ham ratio is
the number of times the word has been seen in a ham message divided by
the total number of ham messages that have been trained. The raw spam
ratio is computed the same way, and then the two ratios are combined to
form the spamprob for that word. If you have trained on 2000 ham
messages, then a word that has appeared 100 times would have a raw ham
score of 0.05. If you have only trained on 200 spam messages then it
only takes 10 occurences of the word in spam to get the same 0.05 score.
--
Kenny Pitt
More information about the Spambayes
mailing list