[Spambayes] Does SB tokenize the subject?
Skip Montanaro
skip at pobox.com
Mon Dec 27 16:58:25 CET 2004
Amir> I've just received a spam which scored 0%. The subject was 'No
Amir> prescription? no problem!". I examined the message (with
Amir> Pocketknife Peek) and true, the message itself was pure text with
Amir> absolutely no spam token. The spam was in an HTML attachment that
Amir> fell off somewhere on the way. So 0% is OK for the contents, but
Amir> examining the clues, I did not see the word "prescription" there.
Amir> Bug or feature?
Feature. Yes, Spambayes tokenizes the subject. In this case it would have
emitted these tokens:
'subject:prescription'
'subject:problem'
'subject: '
'subject:? '
'subject: '
'subject:!'
(quoted so you can see there is whitespace in some of them).
If you had never seen "subject:prescription" in earlier training, it would
have been ignored for scoring purposes though. In case you're wondering
about the punctuation and whitespace, testing revealed that leaving them in
when tokenizing subjects helped.
Skip
More information about the Spambayes
mailing list