[spambayes-dev] testing tweaks

Tue Aug 12 01:45:23 EDT 2003

There was a thread partly about the mixed unigram/bigram scheme last
November, starting here:

    http://mail.python.org/pipermail/spambayes/2002-November/001912.html

It wasted time starting with a unigram+bigram+trigram scheme, and wasted
more time trying to use hash codes to reduce the database burden (we've
regretted that every time we've tried it).

The spambayes results on my main test data were already so good then that
testing couldn't verify any claimed improvement (it could only demonstrate
that a suggested idea did worse).  The "I only had time to run a few tests
on that, and it looked very promising" refers to later small tests I never
wrote up.  They were closest to what a msg late in this thread called "bix"
(exact (non-hashing) bigrams).

Like Tony did, I was really using token bigrams (and trigrams, at the
start).  There were many mysteries related to bigrams created from header
tokens, as pointed out in several of that thread's messages.  Another
mystery covered there is that split-on-whitespace still beat "extract words"
for the fundamental tokenization gimmick.  It's a mystery because the only
"reason" I ever found for s-o-w winning with unigrams was the weak context
info it offers (like "free!!" is more likely to be spammy than "free").
Moving to bigrams (or higher) really should give much stronger context info
than we get from keeping punctuation.

So many mysteries, so little time ...