[Spambayes] Ping: subject header ignored? [was: Not mining mySubject headers?]
David Abrahams
dave at boost-consulting.com
Tue Feb 6 21:09:57 CET 2007
"Seth Goodman" <sethg at goodmanassociates.com> writes:
> David Abrahams wrote on Tuesday, February 06, 2007 11:05 AM -0600:
>
>> David Abrahams <dave at boost-consulting.com> writes:
>>
>> > How is it that for a message with
>> >
>> > Subject: Huge online pharmacy
>> >
>> > Spambayes isn't using "pharmacy" as a classification token? I can't
>> > find a setting that will make it do that, either.
>>
>> Am I just misinterpreting what I'm seeing, or does SB really ignore
>> the Subject header?
>
> The subject header produces tokens that start with the string
> "subject:". When looking at the list of clues Spambayes finds, you
> first see the list of "significant tokens", which means up to 150 (?)
> tokens that score below 0.4 and above 0.6. The complete list is shown
> as "all message tokens". It a token appears in the "all message tokens"
> list but not in the "significant token" list, it's probably because the
> token scored between 0.4 and 0.6, which means the statistics do not
> indicate ham or spam.
I understand from the above that subject words are considered, but it
still seems to me that something must be wrong. Subject lines
containing [spam] are invariably spam. I have 12 messages that
have [spam] in the subject in my spam training folder and zero in my
ham training folder. Yet messages with [spam] in the subject line are
commonly classified as ham or unsure.
When I ask sb_imapfilter about "[spam]" *or* "subject:[spam]" I get
nothing. In fact, if I do a regex query for .*spam.*, I see:
Word # Spam # Ham Probability
(spambayes 0 1 0.155172
spam? 0 2 0.091837
spam, 0 1 0.155172
spam. 14 12 0.506679
spam! 1 0 0.844828
*spam* 0 1 0.155172
subject:spam 13 0 0.983271
spamming? 1 0 0.844828
url:spamguard 13 0 0.983271
spamguard, 13 0 0.983271
which tells me that the tokenizer may be throwing out the brackets.
OK, I see that it's doing so on both ends (when training and when
classifying) so it's okay.
Well, I'm not sure why [spam] hasn't gained more significance, but I
guess I'll just keep training it. Thanks, and sorry for the noise.
--
Dave Abrahams
Boost Consulting
www.boost-consulting.com
More information about the SpamBayes
mailing list