[Spambayes] Does SB tokenize the subject?

Skip Montanaro skip at pobox.com
Mon Dec 27 16:58:25 CET 2004

    Amir> I've just received a spam which scored 0%. The subject was 'No
    Amir> prescription?  no problem!". I examined the message (with
    Amir> Pocketknife Peek) and true, the message itself was pure text with
    Amir> absolutely no spam token. The spam was in an HTML attachment that
    Amir> fell off somewhere on the way. So 0% is OK for the contents, but
    Amir> examining the clues, I did not see the word "prescription" there.
    Amir> Bug or feature?

Feature.  Yes, Spambayes tokenizes the subject.  In this case it would have
emitted these tokens:

    'subject: '
    'subject:? '
    'subject: '

(quoted so you can see there is whitespace in some of them).

If you had never seen "subject:prescription" in earlier training, it would
have been ignored for scoring purposes though.  In case you're wondering
about the punctuation and whitespace, testing revealed that leaving them in
when tokenizing subjects helped.


