[Spambayes] Another optimization

Wed, 18 Sep 2002 10:00:02 -0700

Hello, folks.

Some weeks ago I independently implemented Graham's filter
algorithm, based on his essay on the topic.  Since my
implementation is in perl (please, no flames), I doubt
my code will be all that helpful... and anyway, I just
sort of hacked it together.  If you still want to see it
after that disclaimer, it's available at
http://wolfskeep.com/mailfilter/ .

About a week and a half ago, I got a reference to your
collective work here; I scanned through your code gleaning
ideas.  I figured I should return the favor. ;-)

There are a few things I was wondering about, though:

* Graham was very specific in describing his tokenizer...
  and you folks seem to have ignored that description.
  Instead, you're using split-on-whitespace augmented by
  a few handcrafted hacks for URLs, addresses, and the like.
  This puzzles me, since I seem to get better results using
  the tokenization that Graham suggested.

* Why do you throw out super-long words?  Yes, they
  potentially increase the database size, but I find that
  lines of dashes of various lengths to be particularly
  good spam and non-spam indicators.  Also, some of the
  worms seem to be caught by recognizing their uuencoded
  forms.  Since I'm not storing _all_ words encountered
  (I'm only keeping the probability computations after
  ignoring the few-occurences words) in the database,
  keeping the long words doesn't seem to use much space.
  And disk space is cheap, right?

* One key feature of Graham's algorithm is that it only
  considers the N most significant words in a message;
  however, in your cancellation of .01 and .99 words, you
  seem to throw away the the most significant words in
  favor of less significant words.  (Yes, I understand
  that the math for cancellation works out assuming the
  same set of less significant words is considered...
  but by throwing out the matching words, I think your
  code then considers _more_ of the less significant words.)

  In my code, instead of cancelling out the .01 and .99
  words, I inflate the considered words count to include
  all of these 'maximally significant' words that appear
  in the message, even if that number is >N.  However, I
  do _not_ consider additional less significant words if
  the number of maximally significant words is >N.  I have
  found this to be _much_ more effective; it lowered the
  FN rate while leaving the FP rate constant, whereas
  cancelling and considering more of the less significant
  words increased both FN and FP for me.

  If I managed to misread your code, I apologize for the
  mischaracterization.

My testing is not as rigorous as yours is; in particular,
I don't retrain using several different sets of training
data in the course of evaluating my changes.  As a result,
my experience may be biased. :-/

- Alex