[Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes GBayes.py,1.7,1.8

Thu, 22 Aug 2002 15:37:26 -0400

[Charles Cazabon]
> Is there an easy fix to this problem?

I don't know that there is "a problem".  The step is dubious, but other
steps are also dubious, and naive Bayes itself is dubious (the assumption
that word pobabilities are independent is simply false in this application).
But outside of, perhaps, quantum chromodynamics, all models of reality are
more or less dubious, and it's darned hard to say whether the fudging needed
to make them appear to work is lucky or principled, robust or brittle.  The
more gross deviations there are from a model, though, the less one can
appeal to the model for guidance.  In the limit, you can end up with a pile
of arbitrary tricks, with no idea which gimmicks matter anymore (given
enough free parameters to fiddle, you can train even a horribly
inappropriate model to fit a specific blob of data exactly).

> I implemented this in Python after reading about it on the weekend, and it
> might explain why my  results are not quite as fabulous as the author
noted
> (I'm getting more false positives than he claimed he was).

How many lines of code do you have?  That's a gross lower bound on the
number of places it might have screwed up <wink>.

> Note that I'm not so good with the above notation; I'm more at home with
> plain algebraic stuff :).

It's all plain-- and simple ---algebra, it's just long-winded.  You may be
confusing yourself, e.g., by reading P(S|X) as if it's a complex expression
in its own right.  But it's not -- given an incarnation of the universe, it
denotes a fixed number.  Think of it as "w" instead <wink>.

Let's get concrete.  You have a spam corpus with 1000 messages.  100 of them
contain the word x, and 500 contain the word y.  Then

    P(X|S) = 100/1000 = 1/10
    P(Y|S) = 500/1000 = 1/2

You also have a non-spam corpus with 2000 messages.  100 of them contain x
too, and 500 contain y.  Then

    P(X|not-S) = 100/2000 = 1/20
    P(Y|not-Y) = 500/2000 = 1/4

This is the entire universe, and it's all you know.  If you pick a message
at random, what's P(S) = the probability that it's from the spam corpus?
It's trivial:

    P(S) = 1000/(1000+2000) = 1/3
and
    P(not-S) = 2/3

Now *given that* you've picked a message at random, and *know* it contains
x, but don't know anything else, what's the probability it's spam (== what's
P(S|X)?).  Well, it has to be one of the 100 spam messages that contains x,
or one of the 100 non-spam messages that contains x.  They're all equally
likely, so

    P(S|X) = (100+100)/200 = 1/2
and
    P(S|Y) = (500+500)/500 = 1/2

too by the same reasoning.  P(not-S|X) and P(not-S|Y) are also 1/2 each.

So far, there's nothing a reasonable person can argue with.  Given that this
is our universe, these numbers fall directly out of what reasonable people
agree "probability" means.

When it comes to P(S|X and Y), life is more difficult.  If we *agree* to
assume that word probabilities are independent (which is itself dubious, but
has the virtue of appearing to work pretty well anyway), then the number of
messages in the spam corpus we can expect to contain both X and Y is

    P(X|S)*P(Y|S)*number_spams = (1/10)*(1/2)*1000 = 50

Similarly the # of non-spam messages we can expect to contain both X and Y
is

    (1/20)*(1/4)*2000 = 25

Since that's all the messages that contain both X and Y, the probability
that a message containing both X and Y is spam is

    P(S | X and Y) = 50/(50 + 25) = 2/3

Note that this agrees with the formula whose derivation I spelled out from
first principles:

                P(S|X)*P(S|Y)/P(S)
 --------------------------------------------------- =
 P(S|X)*P(S|Y)/P(S) + P(not-S|X)*P(not-S|Y)/P(not-S)

                (1/2)*(1/2)/(1/3)         2
 -------------------------------------- = -
 (1/2)*(1/2)/(1/3)  + (1/2)*(1/2)/(2/3)   3

It's educational to work through Graham's formulation on the same example.
To start with, P(S|X) is approximated by a different means, and fudging the
"good count" by a factor of 2, giving

    P'(S|X) = (100/1000) / (100/1000 + 2*100/2000) = 1/2

and similarly for P'(S|Y).  These are the same probabilities I gave above,
but the only reason they're the same is because I deliberately picked a spam
corpus size exactly half the size of the non-spam corpus, knowing in advance
that this factor-of-2 fudge would make the results the same in the end.  The
only difference in what's computed then is in the scoring step, where
Graham's formulation computes

    P'(S | X and Y) =  (1/2)*(1/2)/((1/2)*(1/2)+(1/2)*(1/2)) = 1/2

instead of the 2/3 that's actually true in this universe.  If the corpus
sizes diverge more, the discrepancies at the end grow too, and the way of
computing P(S|X) at the start also diverges.

Is that good or bad?  I say no more now than that it's dubious <wink>.

> But the more interesting Python question:  I'm running into some
> performance problems with my implementation.  Details:

English never helps with these.  Whittle it down and post actual code to
comp.lang.python for help.  Or study the sandbox code in the CVS repository
(see the Subject line of this msg) -- it's not having any speed problems.