[Spambayes] Another optimization
T. Alexander Popiel
popiel@wolfskeep.com
Wed, 18 Sep 2002 10:00:02 -0700
Hello, folks.
Some weeks ago I independently implemented Graham's filter
algorithm, based on his essay on the topic. Since my
implementation is in perl (please, no flames), I doubt
my code will be all that helpful... and anyway, I just
sort of hacked it together. If you still want to see it
after that disclaimer, it's available at
http://wolfskeep.com/mailfilter/ .
About a week and a half ago, I got a reference to your
collective work here; I scanned through your code gleaning
ideas. I figured I should return the favor. ;-)
There are a few things I was wondering about, though:
* Graham was very specific in describing his tokenizer...
and you folks seem to have ignored that description.
Instead, you're using split-on-whitespace augmented by
a few handcrafted hacks for URLs, addresses, and the like.
This puzzles me, since I seem to get better results using
the tokenization that Graham suggested.
* Why do you throw out super-long words? Yes, they
potentially increase the database size, but I find that
lines of dashes of various lengths to be particularly
good spam and non-spam indicators. Also, some of the
worms seem to be caught by recognizing their uuencoded
forms. Since I'm not storing _all_ words encountered
(I'm only keeping the probability computations after
ignoring the few-occurences words) in the database,
keeping the long words doesn't seem to use much space.
And disk space is cheap, right?
* One key feature of Graham's algorithm is that it only
considers the N most significant words in a message;
however, in your cancellation of .01 and .99 words, you
seem to throw away the the most significant words in
favor of less significant words. (Yes, I understand
that the math for cancellation works out assuming the
same set of less significant words is considered...
but by throwing out the matching words, I think your
code then considers _more_ of the less significant words.)
In my code, instead of cancelling out the .01 and .99
words, I inflate the considered words count to include
all of these 'maximally significant' words that appear
in the message, even if that number is >N. However, I
do _not_ consider additional less significant words if
the number of maximally significant words is >N. I have
found this to be _much_ more effective; it lowered the
FN rate while leaving the FP rate constant, whereas
cancelling and considering more of the less significant
words increased both FN and FP for me.
If I managed to misread your code, I apologize for the
mischaracterization.
My testing is not as rigorous as yours is; in particular,
I don't retrain using several different sets of training
data in the course of evaluating my changes. As a result,
my experience may be biased. :-/
- Alex