[Spambayes] GBayes spam filtering
Tim Peters
tim.one@comcast.net
Fri, 06 Sep 2002 16:15:18 -0400
[Paul Svensson]
> ...
> When a user reads a message and find that it's spam that got thru
> the filter, they need a way to send the message-id to the corpus, to
> flag it as spam. At this point, it would be a good idea to compare the
> histogram of the new spam to each histogram in the ham corpus, and remove
> any that are similar (any good ideas how to do the comparison?),
Read the "memory-based approach" stuff in
Learning to Filter Spam E-Mail: A Comparison of a Naive
Bayesian and a Memory-Based Approach
http://arxiv.org/ftp/cs/papers/0009/0009009.pdf
> or maybe if they are VERY similar simply flag them as spam. After
> recomputing the filter from the modified corpus, we could also re-filter
> the ham corpus, and remove more newfound spam that way.
>
> Characteristically of this system, the spam corpus will be
> reasonably clean (assuming the users don't abuse it too much), but
> the ham corpus will be quite dirty, containing spam that's not yet read,
> and spam that the recipient didn't bother to mark. I'm curious how
> GBayes would handle this situation; I assume the false negative rate
> would go up, but how much ?
You can run an experiment and measure it. That's almost as easy as, and
much more reliable than, guessing <wink>.