Graham's spam filter

Fernando Pereira pereira at cis.upenn.edu
Sun Aug 25 10:20:43 EDT 2002


On 8/22/02 12:38 PM, in article
mailman.1030034444.14290.python-list at python.org, "Oren Tirosh"
<oren-py-l at hishome.net> wrote:
> I was wondering about another issue - could this system use decision
> feedback? If the system detects an email as having a very low probability of
> being spam (e.g. <0.1) it could be fed back into the system to update its
> statistics continously without human intervension. I assume that spam
> that does pass through will not pass with such low probabilities. More
> likely it will have something over 0.5 but not pass the 0.9 threshold needed
> to label it as spam.
> 
> Decision feedback is powerful but also dangerous - if the system starts
> to make systematic errors they will tend to increase.  \

This is typically a bad idea. Naïve Bayes document classifiers tend to
assign class probabilities that are close to 0 or 1 because they assume
incorrectly that words in a document are statistically independent given the
class. For doing anything along these lines you need really independent
sources of evidence. One approach that has received a lot of attention
recently is co-training;

http://citeseer.nj.nec.com/blum98combining.html

-- F




More information about the Python-list mailing list