[Python-Dev] The first trustworthy <wink> GBayes results

Tim Peters tim.one@comcast.net
Fri, 30 Aug 2002 12:30:55 -0400


[Tim]
> What's an acceptable false positive rate?

[Greg Ward]
> Speaking as one of the people who reviews suspected spam for python.org
> and rescues false positives, I would say that the more relevant figure
> is: how much suspected spam do I have to review every morning?  < 10
> messages would be peachy; right now it's around 5-20 messages per day.

I must be missing something.  I would *hope* that you review *all* messages
claimed to be spam, in which case the number of msgs to be reviewed would,
in a perfectly accurate system, be equal to the number of spams received.

OTOH, the false positive rate doesn't have anything to do with the number of
spams received, it has to do with the number of non-spams received.

> Currently there are probably 1-3 FPs per day, although on a bad day
> there can be 5-10.  (Eg. on 2002-08-21, six mailman-users posts from the
> same guy were all caught, mainly because his ISP added X-AntiAbuse, and
> his messages were multipart/alternative with unwrapped plain text.  This
> is a perfect example of SpamAssassin screwing up royally.)  1-3 FPs/day
> I can live with, but the real burden is the manual review: I'd much
> rather have 5 FPs in a pool of 10 suspects than 1 FP out of 100
> suspects.

Maybe you don't want this kind of approach at all.  The classifier doesn't
have "gray areas" in practice:  it tends to give probabilites near 1, or
near 0, and there's very little in between -- a msg either has a
preponderance of spam indicators, or a preponderance of non-spam indicators.
You're simply not going to get a batch of "hmm, I'm not really sure about
these" out of it.  You would in a conventional Bayesian classifer, but
Graham's ignores almost all of the words, judging on only the most extreme
words present; when only extremes are fed in, the final result also tends to
be extreme (the only cases where that doesn't obtain are those where the
most extreme words it finds aren't extreme at all; e.g., a msg consisting
entirely of "the", "and" and "it" would get rated as 0.5).

>> What do we get from SpamAssassin?

> Recall the stats I posted this morning; the bulk of spam is in Chinese
> or Korean, and I have things setup so SpamAssassin never even sees it.
> I think the only way to meaningfully answer this question is to stash
> *everything* mail.python.org receives for a day or 10, spam and
> otherwise, and run it all through SA.

It would be good to have such a corpus regardless.