[Spambayes] GBayes spam filtering

Tim Peters tim.one@comcast.net
Fri, 06 Sep 2002 16:15:18 -0400


[Paul Svensson]
> ...
> When a user reads a message and find that it's spam that got thru
> the filter, they need a way to send the message-id to the corpus, to
> flag it as spam.  At this point, it would be a good idea to compare the
> histogram of the new spam to each histogram in the ham corpus, and remove
> any that are similar (any good ideas how to do the comparison?),

Read the "memory-based approach" stuff in

    Learning to Filter Spam E-Mail:  A Comparison of a Naive
    Bayesian and a Memory-Based Approach

    http://arxiv.org/ftp/cs/papers/0009/0009009.pdf


> or maybe if they are VERY similar simply flag them as spam.  After
> recomputing the filter from the modified corpus, we could also re-filter
> the ham corpus, and remove more newfound spam that way.
>
> Characteristically of this system, the spam corpus will be
> reasonably clean (assuming the users don't abuse it too much), but
> the ham corpus will be quite dirty, containing spam that's not yet read,
> and spam that the recipient didn't bother to mark.  I'm curious how
> GBayes would handle this situation; I assume the false negative rate
> would go up, but how much ?

You can run an experiment and measure it.  That's almost as easy as, and
much more reliable than, guessing <wink>.