[Spambayes] spamprob combining

Jim Bublitz jbublitz@nwinternet.com
Wed, 09 Oct 2002 15:55:47 -0700 (PDT)


On 08-Oct-02 Tim Peters wrote:
> It's hard to know what to make of this, especially in light of
> the claim that Gary-combining has been proven to be the most
> sensitive possible test for rejecting the hypothesis that a
> collection of probs is uniformly distributed.  At least in this
> test, Paul-combining seemed far more sensitive (even when the
> data is random <wink>).
 
> Intuitively, it *seems* like it would be good to get something
> not so insanely sensitive to random input as Paul-combining, but
> more sensitive to overwhelming amounts of evidence than
> Gary-combining.  Even forcing 50 spamprobs of 0.99, the latter
> only moves up to an average of 0.7:

Since my last msg was incomprehensible, I'm just going to attach my
code at the bottom and refer to it.

Graham's original score calculation - 

    product/(product + inverseProducct)

does give the kind of score distribution you described. If you
substitute Gary Robinson's suggestion (see below - last few lines),
the score distribution does spread out to the center a little bit.
You can get Robinson's scoring calculation (as below) to produce a
normal distribution around the mean ham or spam score if you
either:

a. Increase VECTOR_SIZE (max_discriminators??) - a value
of around 100 seems to do pretty well

b. Instead of selecting the most extreme N word probabilities
from the msg being tested, select the words randomly from the list
of words in the msg (not shown in code below). You immediately
(VECTOR_SIZE = 15) get a normal distribution around the means, but
accuracy sucks until you select 75 to 100 words/msg randomly.

Neither (a) nor (b) works as well as the 15 most extreme words on
my test data. Also, Robinson's calculation doesn't produce ham at
0.99 or spam at 0.01 - in fact the msgs that I had a hard time
classifying manually are (mostly) the ones that fall near the
cutoff.

Note also that the code below will produce an unpredictable score
if the msg contains only 30 .01 words and 30 .99 words. It depends
on how pairs.sort (...) handles ties. Making the limits
asymmetrical (eg .989 and .01 instead of .99/.01) doesn't seem to
work very well.

The other thing that helps make the scores extreme in actual use is
that the distribution of word probabilities is extreme. For my
corpora using the code below I get 169378 unique tokens (from 24000
msgs, 50% spam):

Probability    Number of Tokens  % of Total
[0.00, 0.01)       46329           27.4%  (never in spam)
[0.99, 1.00)      104367           61.7%  (never in ham)
                                   -----
                                   89.1%

>From looking at failures (and assuming passes behave similarly) the
10.9% (~17000 tokens) in between 0.01 and 0.99 still do a lot of the
work, which makes sense, since those are the most commonly used
words.

My experience has been that the tail tips of the score distribution
maintain about the same distance from the mean score no matter what
you do. If you improve the shape of the distribution (make it look
more normal), you move the tails about the same distance as the
distribution has spread out, and the ham and spam tails overlap
more and more, increasing the fp/fn rates. The little testing I did
on Spambayes (last week's CVS) seemed to show the same effect.

For the code below, if I train on 8000 msgs (50% spam) and then
test 200, retrain on those 200, and repeat for 16000 msgs, I get 4
fns (3 are identical msgs from the same sender with different dates,
all are Klez msgs) and 1 fp (an ISP msg "Routine Service
Maintenance"), which are fn and fp rates of 0.05% and 0.01%. The
failures all scored in the range [0.495, 0.511] (cuttoff at 0.50)
I ran the the SA Corpus today also and don't get any failures if I
train on 8K of my msgs and 50/100 of their msgs (worse results under
other conditions), but the sample sizes there are too small to do an
adequete training sample and have enough test data to have
confidence in the results. I can post those results if anyone is
interested.

Graham's method was basically designed to produce extreme scores,
and the distribution of words in the data seems to reinforce that.

If it's of any use to anybody (it's certainly beyond me), both the
distribution of msg scores and distribution of word probabilities
look like exponential or Weibull distributions. (They're "bathtub"
curves, if anyone is familiar with reliability statistics).

This is all based on my data, which is not the same as your data.
YMMV.

Jim


# classes posted to c.l.p by Erik Max Francis
# algorithm from Paul Graham ("A Plan for Spam")

# was TOKEN_RE = re.compile(r"[a-zA-Z0-9'$_-]+")
# changed to catch Asian charsets
TOKEN_RE = re.compile(r"[\w'$_-]+", re.U)
FREQUENCY_THRESHHOLD = 1 # was 5
GOOD_BIAS = 2.0
BAD_BIAS = 1.0
# changed to improve distribution 'width' because
# of smaller token count in training data
GOOD_PROB = 0.0001 # was 0.01
BAD_PROB = 0.9999  # was 0.99
VECTOR_SIZE = 15
UNKNOWN_PROB = 0.5 # was 0.4 or 0.2

# remove mixed alphanumerics or strictly numeric:
#  eg: HM6116, 555N, 1234 (also Windows98, 133t, h4X0r)
pn1_re = re.compile (r"[a-zA-Z]+[0-9]+")
pn2_re = re.compile (r"[0-9]+[a-zA-Z]+")
num_re = re.compile (r"^[0-9]+")


class Corpus(dict):
    # instantiate one training Corpus for spam, one for ham,
    # and then one Corpus for each test msg as msgs are tested
    # (the msg Corpus instance is destroyed after
    # testing the msg)

    def __init__(self, data=None):
        dict.__init__(self)
        self.count = 0
        if data is not None:
            self.process(data)

    # process is used to extract tokens from msg,
    # either in building the training sample or
    # when testing a msg (can process entire msg
    # or one part of msg at a time)
    # 'data' is a string

    def process(self, data):
        tokens = TOKEN_RE.findall(str (data))
        if not len (tokens): return

        # added the first 'if' in the loop to reduce
        # total # of tokens by >75%
        deletes = 0
        for token in tokens:
            if (len (token) > 20)\
                or (pn1_re.search (token) != None)\
                or (pn2_re.search (token) != None)\
                or (num_re.search (token) != None):
                deletes += 1
                continue

            if self.has_key(token):
                self[token] += 1
            else:
                self[token] = 1

        # count tokens, not msgs
        self.count += len (tokens) - deletes


class Database(dict):
    def __init__(self, good, bad):
        dict.__init__(self)
        self.build(good, bad)

    # 'build' constructs the dict of token: probability
    # run once after training from the ham/spam Corpus
    # instances; the ham/spam Corpus instances can be
    # destroyed (after saving?) after 'build' is run

    def build(self, good, bad):
        ngood = good.count
        nbad = bad.count
#        print ngood, nbad, float(nbad)/float(ngood)

        for token in good.keys() + bad.keys(): # doubles up, but
                                               # works
            if not self.has_key(token):
                g = GOOD_BIAS*good.get(token, 0)
                b = BAD_BIAS*bad.get(token, 0)

                if g + b >= FREQUENCY_THRESHHOLD:
                    # the 'min's are leftovers from counting
                    # msgs instead of tokens for ngood, nbad
                    goodMetric = min(1.0, g/ngood)
                    badMetric = min(1.0, b/nbad)
                    total = goodMetric + badMetric
                    prob = max(GOOD_PROB,\
                        min(BAD_PROB,badMetric/total))

                    self[token] = prob

    def scan(self, corpus):
        pairs = [(token, self.get(token, UNKNOWN_PROB)) \
            for token in corpus.keys()]

        pairs.sort(lambda x, y: cmp(abs(y[1] - 0.5), abs(x[1]\
                         - 0.5)))
        significant = pairs[:VECTOR_SIZE]

        inverseProduct = product = 1.0
        for token, prob in significant:
            product *= prob
            inverseProduct *= 1.0 - prob

# Graham scoring - was:
#        return pairs, significant, product/(product +\
#                                inverseProduct)
# 'pairs' and 'significant' added to assist data logging, evaluation

# Robinson scoring - don't know why, but this works great

        n = float (len (significant)) # n could be < VECTOR_SIZE

        # div by zero possible if no headers (and msg has no body)
        try:
            P = 1 - inverseProduct ** (1/n)
            Q = 1 - product ** (1/n)
            S = (1 + (P - Q)/(P + Q))/2
        except:
            S = 0.99

        return pairs, significant, S