[Spambayes] Promoting Spambayes (was Re: FYI: Java implementation)

Tim Peters tim.one at comcast.net
Tue Jan 21 14:38:12 EST 2003


[Justin Mason]
> BTW it's worth noting we didn't just "nab" the ideas ;)

I would have <wink>.

> Instead I reimplemented based on descriptions, running a cross-validation
> test each time, and threw in a few tokenization ideas of our own.

One thing we found, on rare occasions, is that a change vetted as winner or
loser via a CV run on one set of test data turned out to be neutral on
somebody else's test data, or (very rarely) even gave an opposite result.
Some small amount of that is expected by chance, of course, but multiple
test sets (in addition to slicing & dicing a single test set) is an
important check too.

> In most cases the results indicated that SpamBayes' techniques are the
> most effective -- there were a few extras, like SpamAssassin tokenizing
> some headers that SB doesn't (From etc.),

There are generally options to change all that.  I became inactive as this
project was transitioning from mostly-research to mostly-deployment, and the
defaults still reflect the more severe "purity needs" of research.  For
example, virtually all the ham in my main test set had a common "From" line
(it was generated by a news->email gateway) but none of my spam had that
>From line.  So "From" was ignored by default.  In the Outlook 2000 client I
use every day, though, From To Cc Sender and Reply-To are all tokenized.

> and different S and X values,

Note that Greg Louis has done a lot of good research on those, in connection
with bogofilter.

> but for the most part they're effectively the same.
>
> The nice thing is that it means those techniques have been independently
> verified by 2 parties -- in other words, a scientific process ;)

It's appreciated!  That's more important than the specific algorithms used.
Given a proper test framework, the data will eventually tell you what does
and doesn't work; without proper statistical testing it's all guessing.  A
problem is what to do when error rates get too low to measure reliably.  My
previous life in speech recognition didn't prepare me for that one <wink>.




More information about the Spambayes mailing list