[spambayes-dev] Understanding Classifier Code

Tue Feb 17 22:35:29 EST 2004

I've been looking at ways of increasing the speed at which the classifier 
runs. I think that increasing the speed of the classifier would be one of 
the things required to make spam-bayes computationally worthwhile to run on 
a gateway for several thousand users. (others include keeping all word 
records in memory, and writing the really tight parts into C or assembler) 
No arguments yet please, I just need some help to understand what the 
classifier is doing and why. I'll paste the code and interject when i don't 
understand why something is happening that way. (I don't like to modify code 
without understanding it fully, you're asking for trouble if you do that.) I 
really appreciate any help you can give me on this, the more the merrier. 
And if my team manages to get spambayes working well on our server we'll be 
sure to share the modifications with you.

        H = S = 1.0
        Hexp = Sexp = 0

        clues = self._getclues(wordstream)
        for prob, word, record in clues:
            S *= 1.0 - prob
            H *= prob
            if S < 1e-200:  # prevent underflow
                S, e = frexp(S)
                Sexp += e
            if H < 1e-200:  # prevent underflow
                H, e = frexp(H)
                Hexp += e

Tell me, why a seperate spam/ham score at this point?

        # Compute the natural log of the product = sum of the logs:
        # ln(x * 2**i) = ln(x) + i * ln(2).
        S = ln(S) + Sexp * LN2
        H = ln(H) + Hexp * LN2

Okay i can see from the comment that this is equiv to  ln(x * 2**i)

But why take the logarithim of the final spam/ham prob?

        n = len(clues)
        if n:
            S = 1.0 - chi2Q(-2.0 * S, 2*n)
            H = 1.0 - chi2Q(-2.0 * H, 2*n)

Why multiply the score by -2? why double n?

def chi2Q(x2, v, exp=_math.exp, min=min):
    """Return prob(chisq >= x2, with v degrees of freedom).

    v must be even.
    """
    assert v & 1 == 0
    # XXX If x2 is very large, exp(-m) will underflow to 0.
    m = x2 / 2.0
    sum = term = exp(-m)
    for i in range(1, v//2):
        term *= m / i
        sum += term

What's going on here? Why do we take the exp() of -x2/2?

Why did we multiply x2 by 2 if we divide it by 2 again anyway?

What is the loop doing and why?

    # With small x2 and large v, accumulated roundoff error, plus error in
    # the platform exp(), can cause this to spill a few ULP above 1.0.  For
    # example, chi2Q(100, 300) on my box has sum == 1.0 + 2.0**-52 at this
    # point.  Returning a value even a teensy bit over 1.0 is no good.
    return min(sum, 1.0)

Thanks again!

_________________________________________________________________
Help STOP SPAM with the new MSN 8 and get 2 months FREE*   
http://join.msn.com/?page=dept/bcomm&pgmarket=en-ca&RU=http%3a%2f%2fjoin.msn.com%2f%3fpage%3dmisc%2fspecialoffers%26pgmarket%3den-ca