[Spambayes] Re: Further Improvement 2

Sun, 22 Sep 2002 12:34:53 -0400

With regard to the

          a + y
f(w) = ------------
       (a / x) + n

calc, Tim Peters says:

> What I can't tell you is
> *which* prior distribution it's assuming, or how changing "a" affects its
> shape.  But since we know for sure that getting rid of MINCOUNT made a real
> difference before, we also know for sure that the treatment of rare words is
> important, and that's what "a" is all about.  This is the point where Gary
> unexpectedly discovers he has just enough free time to tell us more about
> the specifics of this distribution <wink>.
> 

OK. 

I started to try to explain it in a way that people without Bayesian stats
in their background could understand but it quickly look like it would take
me more time than I can afford.  So I had to let that goal go -- but I think
that's OK because I will point to another source where a reader can go to
get the parts of the explanation I will leave out.

Note also that I want to change the form of the equation from the one
currently in the essay to a slightly simpler equivalent one. I will change
it in the essay too. The version we've been working with says:

          a + y
f(w) = ------------
       (a / x) + n

but

       (s * x) + y
f(w) = -----------
         s + n

is equivalent and more intuitive.

Moreover, it's hard to use a as a variable name when communicating by text
because it looks like a part of speech! ;)   And "s" has another appropriate
meaning, as we will see below.

If you want to understand this in detail and haven't studied Bayesian stats
please refer to this paper: http://citeseer.nj.nec.com/135897.html, which is
a tuturial by Microsoftee David Heckerman on Bayesian networks. You only
need to read through page 7.

>From this point on I take it for granted that the reader understands as much
Bayesian theory as is communicated by Heckerman in those 7 pages. But even
without that understanding the reader will probably be able to get a very
rough high-level view out of this.

Let us start in an imaginary world where there are exactly as many spams as
nonspams in the training corpus. Then I'll bring it back to the real world
at the end. For now, there are the same number of spams and hams.

We are going to do a test that is the equivalent of the "experiment" of
tossing a coin multiple times to see whether it appears to be biased. There
are n trials. If we were tossing a coin, each coin toss would be a "trial"
and we'd be counting the number of heads. But in our case, the trial is
looking at the next occurrence of "Nigeria" in the training corpus and
seeing whether the email it contains is a spam.

Clearly, it's a binomial experiment: there are two values, yes or no. And
they are independent, at least if you only consider the first appearance of
Nigeria in a given email and don't use the other occurrences in that email
as trials. (So we'll assume we're doing that and bring it back to Grahamian
reality at the end.)

Now, it happens that if you have a binomial experiment and assume a beta
distribution for the prior, you can express the expectation that the nth + 1
trial will come out "yes" as

(s * x) + y
-----------.  (See Heckerman.)
   s + n

A beta distribution is determined by 2 parameters. Heckerman calls them
alpha-h and alpha-t but let's just call them u and v since we can't send an
alpha character in a text email reliably (as far as I know).

The relationship between s, x, u, and v is as follows:

s = u + v
s * x = u

If you want to see the affects of changing s on the beta distribution, just
hold x constant and recalculate u and v, then graph it.

But actually looking at the beta distribution is just esoterica for our
application. The beauty of using s and x rather than u and v in our calcs is
that intuitively, we can see s as the "strength" (hence the use of "s") of
an original assumed expectation for y relative to how strongly we want to
consider our actual collected data. And the x represents that expectation.
And when n is 0, the expression directly gives our assumed expectation. So,
while u and v more directly describe the particular beta distribution, s and
x are much more helpful for our actual use of the calculation. In particular
s can be "tuned" according to whatever strength of the prior assumption
leads to the best performance, and x can be tuned for whatever assumed
Grahamian probability leads to the best performance. If desired, a
reasonable value for x can be calculated from the training data by averaging
Graham's probability, which we shall call p(w), for all words w.

So, suppose we have a corpus with the same number of spams as hams, and we
want to calculate an expectation for the word "Nigeria." We start looking
through emails, one by one, incrementing n for each email that contains
"Nigeria," and incrementing y if that email is a spam.

Before we look at our first email, our Bayesian calc gives us our assumed
expectation, x:

    (s + x) + 0
x = -----------.
      s + 0

As we get more data, the expression moves gradually farther away from x,
asymptotically approaching the actual probability that an email containing
"Nigeria" is a spam.

This is all well and good, except for one thing: we don't have the same
number of spams and hams in our training data.

The question is, do we want to give more weight to evidence in favour of
spamminess or hamminess because of the fact that the particular individual
using a system built on this technique might happen to receive more ham than
spam (or vice versa)?

Graham implicitly assumes not. And I tend to agree. And most importantly, in
testing, this technique, with Graham's assumption, seems to "work."

Here I will quote from Tim Peters' description of the Grahamian method (by
personal email):

> suppose Nigeria appears in 50 ham and 50 spam...  In [Graham's]
> scheme, it also depends on the total number of hams and of spams.  Say there
> are 100 spams and 200 hams.  Then Graham effectively says "OK, it appears in
> half of the spams, but only a quarter of the hams, so if I see Nigeria it's
> twice as likely to be a spam, so the spamprob is really 2/3"

(If you're familiar with Graham's technique, note that we're ignoring the
fact that he multiplies the ham counts by 2, which testing is showing is not
helpful when certain other changes are made and which we have made.)

Now, when we're using this Grahamian calculation, that's different from the
simple count described above for the binomial experiment in order to derive
an expectation. However:

Again, let p(w) be the result of the Grahamian calculation described by Tim
Peters. It is easy to see that the expression (n * p(w)), where n is the
number of "Nigeria" instances looked at, approximates the value the counter
I describe above WOULD have had if we actually had the same number of spams
and hams (of course this becomes more true the larger the number of spams
and hams we have).

For instance, in Tim's example, let's assume that instead of 100 spams and
200 hams, there were 150 spams and 150 hams, leaving the total number of
emails at the same 300. Then "Nigeria" would have appeared in approximately
75 spams and 37.5 hams. So, our counters would have arrived at approximately
the values:

n= 112.5
y = 75

(And of course 75/112.5 = 2/3 = .6666.)

Let p(w) be the Grahamian probability, .666666.

Then our formula would be:

       (s * x) + 75
f(w) = ------------
        s + 112.5  

  (s * x) + (112.5*.666666)
= -------------------------
        s + 112.5 

  (s * x) + (n * p(w))
= --------------------
        s + n     

And we have arrived at the desired expression.

Note 1:  We could be more rigorous about the derivation of our expression
from Graham's. The expectation of ham or spam for Nigeria could both be
separately calculated using a binomial random variable with a beta prior.
But for corpuses of any size at all, the s and x terms would be completely
overwhelmed by the actual data, so there is really no point in going through
the exercise of doing that calculation. However, I believe this procedure
could be used to derive a fairly rigorous justification of the Grahamiam
assumption, which I am not taking the time to do here.

Note 2: In calculating p(w) Graham counts every instance of word w in every
email in which it appears. Obviously, if a word appears once in an email,
there is a greater probability that it will appear again in that same email
than if it hadn't already appeared at all. So, the random variable is not
really independent under Graham's technique which is one reason why, in the
description above, we only count the first occurrence. However, we are
pragmatic and whatever works best in practice is what we should do. There is
some evidence at this point that using the Graham counting technique leads
to slightly better results than using the "pure" technique above. This may
because it is not ignoring any of the data. So, p(w) and n should simply be
computed the way that gives the best results.

Note 3: One could invoke the "naive Bayesian assumption" of independence
even where the variables are known to not be independent. That has been
proven to be an acceptible assumption in certain contexts usually having to
do with classification. I don't think this application meets the
requirements for invoking those proofs but I haven't studied that question
in detail.