[Python-Dev] The first trustworthy <wink> GBayes results

Tim Peters tim.one@comcast.net
Fri, 30 Aug 2002 12:45:44 -0400


[Skip Montanaro]
> ...
> One thing I think would be worthwhile would be to run GBayes first, then
> only run stuff it thought was spam through SpamAssassin.  Only
> messages that both systems categorized as spam would drop into the spam
> folder.  This has a couple benefits over running one or the other in
> isolation:
>
>     * The training set for GBayes probably doesn't need to be as big

Training GBayes is cheap, and the more you feed it the less need to do
information-destroying transformations (like folding case or ignoring
punctuation).

>     * The two systems use substantially different approaches to
>       identifying spam,

Which could indeed be a killer-strong benefit.

>       so I suspect your false positive rate would go way down.

I'm already having a real problem with this just looking at content:  the
false positive rate is already so low that I can't make statistically
significant conclusions about things that may improve it (e.g., if I do
something that removes just *one* false positive in a test run on 4000 hams,
the false-positive rate falls by 12.5% -- I don't have enough false
positives to make fine-grained judgments.  And, indeed, every time I test a
change to the algorithm, the most *significant* thing I find is that it
turns up another class of blatant spam hiding in the ham corpus:  my
training data is still too dirty, and cleaning it up is labor-intensive).

>       False negatives would go up, but only testing can suggest by how
>       much.
>
>     * Since SA is dog slow most of the time, SA users get a big speedup,
>       since a substantially smaller fraction of your messages get run
>       through it.
>
> This sort of chaining is pretty trivial to setup with procmail.
> Dunno what the Windows set will do though.

There are different audiences here.  Greg is keen to have a better approach
for python.org as a whole, while Barry is keen about that and about doing
something more generic for Mailman.  Windows isn't an issue for either of
those.  Everyone else can eat cake <wink>.