[Spambayes] RE: Further Improvement 2

Tim Peters tim.one@comcast.net
Sun, 22 Sep 2002 15:37:00 -0400


[Gary Robinson]
> With regard to the
>
>           a + y
> f(w) = ------------
>        (a / x) + n
>
> calc ...

Thank you, Gary!  This is very helpful.  May I check this explanation into
the project (with attribution, of course, and a link to your web page)?
We're very low on docs explaining the underpinnings of our code, and this
was wonderfully lucid.

> ...
> Note 3: One could invoke the "naive Bayesian assumption" of independence
> even where the variables are known to not be independent. That has been
> proven to be an acceptible assumption in certain contexts usually
> having to do with classification. I don't think this application meets
> the requirements for invoking those proofs but I haven't studied that
> question in detail.

We wrestled a lot with what to do about HTML markup in this project.
Whether a msg contains, e.g., "<html>", and whether it contains "<br>", are
nearly 100% correlated in the real world.  Both are strong spam indicators
simply because so much spam comes in HTML form.  This makes naive Bayes
*grossly* overestimate the probability that a doc containing both is spam
(P(Spam|<html> and <br>) ~= P(Spam|<html>) in real life, but naive Bayes
thinks it's much more likely if it contains both).  Multiply that by a
hundred other indicators nearly unique to HTML markup, and, in the end, we
had to resort to stripping HTML markup entirely, else no HTML message could
escape being called spam.  We still seem to get bad effects from that
P(Spam|content-type=text/html) is not independent of P(Spam|"&nbsp;") in
real life.