[Spambayes] Spam Clues: Re: index.cgi redirection
"Grützmacher, Lukas"
gruetzmacher at ais-dresden.de
Wed Nov 12 11:27:34 EST 2003
The SpamBayes Manager reports the training status as about 1600 ham and 135 spam mails, even I think I had more spam mails.
Do I understand you right: Because I have more ham then spam mails my training becomes unbalanced ?
Lukas
> -----Original Message-----
> From: Kenny Pitt [mailto:kennypitt at hotmail.com]
> Sent: Wednesday, November 12, 2003 5:13 PM
> To: Grützmacher, Lukas; spambayes at python.org
> Subject: RE: [Spambayes] Spam Clues: Re: index.cgi redirection
>
>
> Grützmacher, Lukas wrote:
> > Even I have trained SpamBayes with many (over 100) mails from this
> > list as good the most of them are identified as "possible spam". I'm
> > currently not able to understand why.
> >
> > 1) Can you explain me what parts of the Spam Clues are calculated to
> > reach the 0.386739 (below) for the mail ? (I could not found any
> > description in the documentation !?) 2) Is it a problem of SpamBayes
> > or of the list or of my configuration ?
> >
> > Spam Score: 39% (0.386739)
> >
> >
> > word spamprob #ham #spam
> > 'proto:http' 0.614138 989 127
> > 'can' 0.61691 415 54
> > 'are' 0.631922 418 58
> > 'you' 0.63923 601 86
> > 'header:Date:1' 0.647151 912 135
> > 'header:From:1' 0.647151 912 135
> > 'header:Return-Path:1' 0.653197 888 135
> > 'header:Message-ID:1' 0.656727 738 114
> > 'to:no real name:2**0' 0.679622 677 116
> > 'header:Received:3' 0.812182 236 83
>
> How many total hams and spams have you trained on? The clues I left
> above particularly stood out to me because you are getting relatively
> high spam probabilities even though the ham counts are much
> higher than
> the spam counts. This usually indicates that you have unbalanced
> training data where you have a lot more messages of one type than the
> other. In this case, I would guess several thousand hams vs. only a
> couple hundred spams.
>
> Unbalanced training data can cause accuracy problems, and in
> particular
> can make it difficult for additional training to overcome the
> effects of
> words that appear in both ham and spam. All probabilities
> are based on
> ratios, not absolute numbers. For a given word, the raw ham ratio is
> the number of times the word has been seen in a ham message divided by
> the total number of ham messages that have been trained. The raw spam
> ratio is computed the same way, and then the two ratios are
> combined to
> form the spamprob for that word. If you have trained on 2000 ham
> messages, then a word that has appeared 100 times would have a raw ham
> score of 0.05. If you have only trained on 200 spam messages then it
> only takes 10 occurences of the word in spam to get the same
> 0.05 score.
>
> --
> Kenny Pitt
>
>
More information about the Spambayes
mailing list