[spambayes-dev] Trigraphs as indicators of invalid subject words

Skip Montanaro skip at pobox.com
Fri Jun 6 15:45:58 EDT 2003


    [ ... on using trigraphs as clues to identify bogus words in message
    subjects ... ]

    >> Now you could turn things around and say the subject contained an
    >> invalid word.  That might be a useful clue for Spambayes.

    Scott> That was my idea.  Find a way to use the non-wordness to
    Scott> penalize, rather than favor a message.

I tried it and found it had essentially no effect.  That doesn't mean it
isn't a good idea.  It's just that Spambayes is already so good that there
isn't much room for improvement.  I just ran a 10x10 cross validation test
using 500 spams and 500 hams in each test set.  It trained on 9 sets each
(4500 messages) of hams and spams then tested against the remaining one set
of each, then repeated choosing a different set to be the test.  Over all
runs it scored 16 hams incorrectly (false positives - 0.32%), scored 40
spams incorrectly (false negatives - 0.80%) and was unsure about 573
messages (5.73%).  When I added in Scott's idea implemented as a synthetic
"subject:invalid word" token, the false positives and false negatives didn't
change.  The unsures crept up to 574.

This was run on a new training database (12700+ hams and 8600+ spams) which
I haven't exhaustively combed for errors, so it's possible there are still
some mistakes of mine in there (placing a ham message in the spam training
set for example), but it is essentially the same data which I use to train
Spambayes and classify messages on a daily basis, so I think it's fairly
clean.

Skip



More information about the spambayes-dev mailing list