[Spambayes] Ping: subject header ignored? [was: Not mining mySubject headers?]

Tue Feb 6 21:09:57 CET 2007

"Seth Goodman" <sethg at goodmanassociates.com> writes:

> David Abrahams wrote on Tuesday, February 06, 2007 11:05 AM -0600:
>
>> David Abrahams <dave at boost-consulting.com> writes:
>>
>> > How is it that for a message with
>> >
>> >   Subject: Huge online pharmacy
>> >
>> > Spambayes isn't using "pharmacy" as a classification token?  I can't
>> > find a setting that will make it do that, either.
>>
>> Am I just misinterpreting what I'm seeing, or does SB really ignore
>> the Subject header?
>
> The subject header produces tokens that start with the string
> "subject:".  When looking at the list of clues Spambayes finds, you
> first see the list of "significant tokens", which means up to 150 (?)
> tokens that score below 0.4 and above 0.6.  The complete list is shown
> as "all message tokens".  It a token appears in the "all message tokens"
> list but not in the "significant token" list, it's probably because the
> token scored between 0.4 and 0.6, which means the statistics do not
> indicate ham or spam.

I understand from the above that subject words are considered, but it
still seems to me that something must be wrong.  Subject lines
containing [spam] are invariably spam.  I have 12 messages that
have [spam] in the subject in my spam training folder and zero in my
ham training folder.  Yet messages with [spam] in the subject line are
commonly classified as ham or unsure.

When I ask sb_imapfilter about "[spam]" *or* "subject:[spam]" I get
nothing.  In fact, if I do a regex query for .*spam.*, I see:

Word           # Spam         # Ham          Probability
(spambayes     0              1              0.155172
spam?          0              2              0.091837
spam,          0              1              0.155172
spam.          14             12             0.506679
spam!          1              0              0.844828
*spam*         0              1              0.155172
subject:spam   13             0              0.983271
spamming?      1              0              0.844828
url:spamguard  13             0              0.983271
spamguard,     13             0              0.983271

which tells me that the tokenizer may be throwing out the brackets.
OK, I see that it's doing so on both ends (when training and when
classifying) so it's okay.

Well, I'm not sure why [spam] hasn't gained more significance, but I
guess I'll just keep training it.  Thanks, and sorry for the noise.

-- 
Dave Abrahams
Boost Consulting
www.boost-consulting.com