[Spambayes] Does SB tokenize the subject?

Skip Montanaro skip at pobox.com
Mon Dec 27 16:58:25 CET 2004


    Amir> I've just received a spam which scored 0%. The subject was 'No
    Amir> prescription?  no problem!". I examined the message (with
    Amir> Pocketknife Peek) and true, the message itself was pure text with
    Amir> absolutely no spam token. The spam was in an HTML attachment that
    Amir> fell off somewhere on the way. So 0% is OK for the contents, but
    Amir> examining the clues, I did not see the word "prescription" there.
 
    Amir> Bug or feature?

Feature.  Yes, Spambayes tokenizes the subject.  In this case it would have
emitted these tokens:

    'subject:prescription'
    'subject:problem'
    'subject: '
    'subject:? '
    'subject: '
    'subject:!'

(quoted so you can see there is whitespace in some of them).

If you had never seen "subject:prescription" in earlier training, it would
have been ignored for scoring purposes though.  In case you're wondering
about the punctuation and whitespace, testing revealed that leaving them in
when tokenizing subjects helped.

Skip


More information about the Spambayes mailing list