[Spambayes] z

Tim Peters tim.one@comcast.net
Tue Oct 15 21:05:33 2002


[Tim]
>> If Rob is feeling particularly adventurous, it would be interesting (in
>> conncection with z-combining) to transform the database spamprobs into
>> unit-normalized zscores via his RMS black magic, as an extra
>> step at the endof update_probabilities().  This wouldn't require another

[Gary Robinson]
> I didn't realize that this wasn't already being done.

It's unclear to me what "this" means.  RMS transformations?  No, we're not
doing those here.

> Yes I would recommend that somebody do this because I don't think we're
> really testing the z approach completely fairly until it is.

You tell me whether this is this <wink>; this is the code people have been
using:

    def z_spamprob(self, wordstream, evidence=False):
        from math import sqrt

        clues = self._getclues(wordstream)
        zsum = 0.0
        for prob, word, record in clues:
            if record is not None:  # else wordinfo doesn't know about it
                record.killcount += 1
            zsum += normIP(prob)

        n = len(clues)
        if n:
            # We've added n zscores from a unit normal distribution.  By the
            # central limit theorem, their mean is normally distributed with
            # mean 0 and sdev 1/sqrt(n).  So the zscore of zsum/n is
            # (zsum/n - 0)/(1/sqrt(n)) = zsum/n/(1/sqrt(n)) = zsum/sqrt(n).
            prob = normP(zsum / sqrt(n))
        else:
            prob = 0.5

normIP() maps a probability p to the real z such that the area under the
unit Gaussian from -inf to z is p.  normP() is the inverse, mapping real z
to the area under the unit Gaussian from -inf to z.  Example:

>>> normIP(.9)
1.2815502653713151
>>> normP(_)
0.8999997718215671
>>> normIP(.1)
-1.2815502653713149
>>> normP(_)
0.10000022817843296
>>>

normP() is accurate to about 14 decimal digits; normIP() is accurate to
about 6 decimal digits.

The word "prob" values here are your f(w).

> I'm not saying I believe that the z approach will turn out to be
> better -- I just don't know -- but it seems worth trying.

Happy to try, but really don't know how to proceed.  There's seems no reason
to believe that the f(w) values lead to normIP() values that are *in fact*
unit-normal distributed on a random collection of words, and I don't
actually see a reason to believe that this would get closer to being true if
the f(w) were ranked first.

If we can define precisely what we mean by "a random collection of words",
the idea that the resulting normIP() values are or aren't unit-normal
distributed seems easily testable, though.