[Spambayes] Seeking a giant idle machine w/ a miserable corpus

Sun Nov 17 05:19:55 2002

[Tim]
> ...
> The "missing test" here is exact bigrams (no hash convolutions).  I'll
> try that later; may not have enough RAM for that, but should.

I didn't, but by cutting 6,000 ham out of my test data managed to complete
the test in < 3 hours with lots of disk thrashing.  cv is baseline, bi21 the
bigram gimmick w/ 2**21 hash buckets, and bix the exact bigram run:

filename:       cv    bi21     bix
ham:spam:  20000:14000     14000:14000
                   20000:14000
fp total:        3       0       0
fp %:         0.01    0.00    0.00
fn total:        0       0       0
fn %:         0.00    0.00    0.00
unsure t:       91     300      98
unsure %:     0.27    0.88    0.35
real cost:  $48.20  $60.00  $19.60
best cost:  $17.80  $23.20   $6.80
h mean:       0.24    0.25    0.25
h sdev:       2.73    2.61    2.82
s mean:      99.95   99.41   99.91
s sdev:       1.40    4.67    1.94
mean diff:   99.71   99.16   99.66
k:           24.14   13.62   20.94

There are simply no surprises in the bix output; under the covers it's
beautiful:

-> <stat> Ham scores for all runs: 14000 items; mean 0.25; sdev 2.82
-> <stat> min 0; median 3.88578e-014; max 69.8223
-> <stat> percentiles: 5% 0; 25% 0; 75% 7.29026e-009; 95% 0.00817516

3 ham scored between 50 and 50.5; 1 ham scored 69.8; all other ham scored
below 50.

-> <stat> Spam scores for all runs: 14000 items; mean 99.91; sdev 1.94
-> <stat> min 30.4227; median 100; max 100
-> <stat> percentiles: 5% 100; 25% 100; 75% 100; 95% 100

2 apam scored between 49.5 and 50.5; 1 spam scored 30.4; all other spam
scored above 50.

The best-cost cutoffs relect this sharp separation:

-> best cost for all runs: $6.80
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at 9 cutoff pairs
-> smallest ham & spam cutoffs 0.495 & 0.7
->     fp 0; fn 1; unsure ham 10; unsure spam 19
->     fp rate 0%; fn rate 0.00714%; unsure rate 0.104%
-> largest ham & spam cutoffs 0.495 & 0.74
->     fp 0; fn 1; unsure ham 10; unsure spam 19
->     fp rate 0%; fn rate 0.00714%; unsure rate 0.104%

The highest-scoring ham is the only reason for the high suggested spam
cuttoff (else .51 would have worked fine and eliminated almost all the
unsures), and was the 2nd-worst in the baseline test:  poor Vickie Mills
suffering from her obnoxious employer-generated sig.  Bigrams helped her,
but it's unclear why(!).  Her three strongest ham clues:

prob('noheader:return-path noheader:abuse-reports-to') = 0.000364471
prob('subject:Python') = 0.00116829
prob('header:From:1 header:MIME-Version:1') = 0.00129855

Now the order in which noheader metatokens get generated is an accident
inherited from the order in which a Set happens to enumerate its elements.
What I fear here is that I've stumbled into a too-good systematic difference
between c.l.py headers and BruceG's spam headers:  not in the *set* of
header lines they contain (I'm ignoring all headers that might matter), but
something much subtler having to do with how the set of all header lines in
existence may affect dict iteration order for the few headers I actually
look at.  Bigrams involving headers lines are striking in the test output;
e.g., this one is a strong spam clue, and is almost as mysterious:

    prob('header:Subject:1 noheader:errors-to') = 0.96321

If we pursue this approach, this will take much thought.

The lowest-scoring spam was the one with a uuencoded text body we throw away
unlooked-at.  A header bigram boosted its spam score in a clearly helpful
way:

    prob('from:addr:hotmail.com from:no real name:2**0') = 0.985839

Indeed, something sent from hotmail without a real name reeks to heaven of
spam, much more so than hotmail alone or no-real-name alone.

OTOH, check out the 6 strongest ham clues for the same spam:

prob('header:X-Complaints-To:1') = 4.00324e-005
prob('header:From:1 header:Date:1') = 0.000279524
prob('header:Subject:1 noheader:received') = 0.000382997
prob('noheader:x-face noheader:return-path') = 0.00045433
prob('header:Organization:1 header:Message-ID:1') = 0.00137332
prob('noheader:in-reply-to noheader:reply-to') = 0.0110694

The X-Complaints-To header has always helped this guy a lot, but why the 5
new header-bigram combos here are hammish remains a mystery.

The memory burden of this run is also a mystery.  I played with mixing
unigrams and bigrams before, and recall the c.l.py test topping out at about
120MB.  This run was over 256MB (hence the massive swapping on my 256MB
box), and it wasn't even a full run.  A difference is that, in my previous
runs, the *tokenizer* generated the unigrams and bigrams, and only for the
body.  In this run the classifier generated them, and header tokens got into
the mix too.  I suppose header bigrams are large (as strings), and that
there are a lot of them -- heck, for lots of msgs, even just the text of the
headers I look at here is larger than the msg bodies.

So while this scheme may have real promise, mysteries and practical problems
abound.  I'm out of time for looking at this.  If someone wants to pursue
it, I'll attach the classifier I used.  Based on everything I've done here,
two suggestions:

1. Every time we've tried them, hash schemes have been unsatisfying,
   due to the human-incomprehensible mistakes they make.  You're
   unlikely to see mistakes on a small run, though -- this is a
   percentage game that *eventually* loses big.

2. I've got some evidence to believe that exact bigrams may help, but
   saw nothing in the exact trigram runs to suggest they buy
   anything over that.  Trigrams helped on-topic c.l.py conference
   announcements, but they also hurt them, and *that* class of
   problem msg is already solidly Unsure under the unigram scheme.
-------------- next part --------------
# An implementation of a Bayes-like spam classifier.
#
# Paul Graham's original description:
#
#     http://www.paulgraham.com/spam.html
#
# A highly fiddled version of that can be retrieved from our CVS repository,
# via tag Last-Graham.  This made many demonstrated improvements in error
# rates over Paul's original description.
#
# This code implements Gary Robinson's suggestions, the core of which are
# well explained on his webpage:
#
#    http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
#
# This is theoretically cleaner, and in testing has performed at least as
# well as our highly tuned Graham scheme did, often slightly better, and
# sometimes much better.  It also has "a middle ground", which people like:
# the scores under Paul's scheme were almost always very near 0 or very near
# 1, whether or not the classification was correct.  The false positives
# and false negatives under Gary's basic scheme (use_gary_combining) generally
# score in a narrow range around the corpus's best spam_cutoff value.
# However, it doesn't appear possible to guess the best spam_cutoff value in
# advance, and it's touchy.
#
# The chi-combining scheme used by default here gets closer to the theoretical
# basis of Gary's combining scheme, and does give extreme scores, but also
# has a very useful middle ground (small # of msgs spread across a large range
# of scores, and good cutoff values aren't touchy).
#
# This implementation is due to Tim Peters et alia.

from __future__ import generators

import math
import time
from sets import Set

from Options import options
from chi2 import chi2Q

try:
    True, False
except NameError:
    # Maintain compatibility with Python 2.2
    True, False = 1, 0

LN2 = math.log(2)       # used frequently by chi-combining

PICKLE_VERSION = 1

class WordInfo(object):
    __slots__ = ('atime',     # when this record was last used by scoring(*)
                 'spamcount', # # of spams in which this word appears
                 'hamcount',  # # of hams in which this word appears
                 'killcount', # # of times this made it to spamprob()'s nbest
                 'spamprob',  # prob(spam | msg contains this word)
                )

    # Invariant:  For use in a classifier database, at least one of
    # spamcount and hamcount must be non-zero.
    #
    # (*)atime is the last access time, a UTC time.time() value.  It's the
    # most recent time this word was used by scoring (i.e., by spamprob(),
    # not by training via learn()); or, if the word has never been used by
    # scoring, the time the word record was created (i.e., by learn()).
    # One good criterion for identifying junk (word records that have no
    # value) is to delete words that haven't been used for a long time.
    # Perhaps they were typos, or unique identifiers, or relevant to a
    # once-hot topic or scam that's fallen out of favor.  Whatever, if
    # a word is no longer being used, it's just wasting space.

    def __init__(self, atime, spamprob=options.unknown_word_prob):
        self.atime = atime
        self.spamcount = self.hamcount = self.killcount = 0
        self.spamprob = spamprob

    def __repr__(self):
        return "WordInfo%r" % repr((self.atime, self.spamcount,
                                    self.hamcount, self.killcount,
                                    self.spamprob))

    def __getstate__(self):
        return (self.atime, self.spamcount, self.hamcount, self.killcount,
                self.spamprob)

    def __setstate__(self, t):
        (self.atime, self.spamcount, self.hamcount, self.killcount,
         self.spamprob) = t

class Bayes:
    # Defining __slots__ here made Jeremy's life needlessly difficult when
    # trying to hook this all up to ZODB as a persistent object.  There's
    # no space benefit worth getting from slots in this class; slots were
    # used solely to help catch errors earlier, when this code was changing
    # rapidly.

    #__slots__ = ('wordinfo',  # map word to WordInfo record
    #             'nspam',     # number of spam messages learn() has seen
    #             'nham',      # number of non-spam messages learn() has seen
    #            )

    # allow a subclass to use a different class for WordInfo
    WordInfoClass = WordInfo

    def __init__(self):
        self.wordinfo = {}
        self.nspam = self.nham = 0

    def __getstate__(self):
        return PICKLE_VERSION, self.wordinfo, self.nspam, self.nham

    def __setstate__(self, t):
        if t[0] != PICKLE_VERSION:
            raise ValueError("Can't unpickle -- version %s unknown" % t[0])
        self.wordinfo, self.nspam, self.nham = t[1:]

    # spamprob() implementations.  One of the following is aliased to
    # spamprob, depending on option settings.

    def gary_spamprob(self, wordstream, evidence=False):
        """Return best-guess probability that wordstream is spam.

        wordstream is an iterable object producing words.
        The return value is a float in [0.0, 1.0].

        If optional arg evidence is True, the return value is a pair
            probability, evidence
        where evidence is a list of (word, probability) pairs.
        """

        from math import frexp

        # This combination method is due to Gary Robinson; see
        # http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html

        # The real P = this P times 2**Pexp.  Likewise for Q.  We're
        # simulating unbounded dynamic float range by hand.  If this pans
        # out, *maybe* we should store logarithms in the database instead
        # and just add them here.  But I like keeping raw counts in the
        # database (they're easy to understand, manipulate and combine),
        # and there's no evidence that this simulation is a significant
        # expense.
        P = Q = 1.0
        Pexp = Qexp = 0
        clues = self._getclues(wordstream)
        for prob, word, record in clues:
            if record is not None:  # else wordinfo doesn't know about it
                record.killcount += 1
            P *= 1.0 - prob
            Q *= prob
            if P < 1e-200:  # move back into range
                P, e = frexp(P)
                Pexp += e
            if Q < 1e-200:  # move back into range
                Q, e = frexp(Q)
                Qexp += e

        P, e = frexp(P)
        Pexp += e
        Q, e = frexp(Q)
        Qexp += e

        num_clues = len(clues)
        if num_clues:
            #P = 1.0 - P**(1./num_clues)
            #Q = 1.0 - Q**(1./num_clues)
            #
            # (x*2**e)**n = x**n * 2**(e*n)
            n = 1.0 / num_clues
            P = 1.0 - P**n * 2.0**(Pexp * n)
            Q = 1.0 - Q**n * 2.0**(Qexp * n)

            # (P-Q)/(P+Q) is in -1 .. 1; scaling into 0 .. 1 gives
            # ((P-Q)/(P+Q)+1)/2             # ((P-Q+P-Q)/(P+Q)/2             # (2*P/(P+Q)/2             # P/(P+Q)
            prob = P/(P+Q)
        else:
            prob = 0.5

        if evidence:
            clues = [(w, p) for p, w, r in clues]
            clues.sort(lambda a, b: cmp(a[1], b[1]))
            return prob, clues
        else:
            return prob

    if options.use_gary_combining:
        spamprob = gary_spamprob

    # Across vectors of length n, containing random uniformly-distributed
    # probabilities, -2*sum(ln(p_i)) follows the chi-squared distribution
    # with 2*n degrees of freedom.  This has been proven (in some
    # appropriate sense) to be the most sensitive possible test for
    # rejecting the hypothesis that a vector of probabilities is uniformly
    # distributed.  Gary Robinson's original scheme was monotonic *with*
    # this test, but skipped the details.  Turns out that getting closer
    # to the theoretical roots gives a much sharper classification, with
    # a very small (in # of msgs), but also very broad (in range of scores),
    # "middle ground", where most of the mistakes live.  In particular,
    # this scheme seems immune to all forms of "cancellation disease":  if
    # there are many strong ham *and* spam clues, this reliably scores
    # close to 0.5.  Most other schemes are extremely certain then -- and
    # often wrong.
    def chi2_spamprob(self, wordstream, evidence=False):
        """Return best-guess probability that wordstream is spam.

        wordstream is an iterable object producing words.
        The return value is a float in [0.0, 1.0].

        If optional arg evidence is True, the return value is a pair
            probability, evidence
        where evidence is a list of (word, probability) pairs.
        """

        from math import frexp, log as ln

        # We compute two chi-squared statistics, one for ham and one for
        # spam.  The sum-of-the-logs business is more sensitive to probs
        # near 0 than to probs near 1, so the spam measure uses 1-p (so
        # that high-spamprob words have greatest effect), and the ham
        # measure uses p directly (so that lo-spamprob words have greatest
        # effect).
        #
        # For optimization, sum-of-logs == log-of-product, and f.p.
        # multiplication is a lot cheaper than calling ln().  It's easy
        # to underflow to 0.0, though, so we simulate unbounded dynamic
        # range via frexp.  The real product H = this H * 2**Hexp, and
        # likewise the real product S = this S * 2**Sexp.
        H = S = 1.0
        Hexp = Sexp = 0

        clues = self._getclues(wordstream)
        for prob, word, record in clues:
            if record is not None:  # else wordinfo doesn't know about it
                record.killcount += 1
            S *= 1.0 - prob
            H *= prob
            if S < 1e-200:  # prevent underflow
                S, e = frexp(S)
                Sexp += e
            if H < 1e-200:  # prevent underflow
                H, e = frexp(H)
                Hexp += e

        # Compute the natural log of the product = sum of the logs:
        # ln(x * 2**i) = ln(x) + i * ln(2).
        S = ln(S) + Sexp * LN2
        H = ln(H) + Hexp * LN2

        n = len(clues)
        if n:
            S = 1.0 - chi2Q(-2.0 * S, 2*n)
            H = 1.0 - chi2Q(-2.0 * H, 2*n)

            # How to combine these into a single spam score?  We originally
            # used (S-H)/(S+H) scaled into [0., 1.], which equals S/(S+H).  A
            # systematic problem is that we could end up being near-certain
            # a thing was (for example) spam, even if S was small, provided
            # that H was much smaller.
            # Rob Hooft stared at these problems and invented the measure
            # we use now, the simpler S-H, scaled into [0., 1.].
            prob = (S-H + 1.0) / 2.0
        else:
            prob = 0.5

        if evidence:
            clues = [(w, p) for p, w, r in clues]
            clues.sort(lambda a, b: cmp(a[1], b[1]))
            clues.insert(0, ('*S*', S))
            clues.insert(0, ('*H*', H))
            return prob, clues
        else:
            return prob

    if options.use_chi_squared_combining:
        spamprob = chi2_spamprob

    def learn(self, wordstream, is_spam, update_probabilities=True):
        """Teach the classifier by example.

        wordstream is a word stream representing a message.  If is_spam is
        True, you're telling the classifier this message is definitely spam,
        else that it's definitely not spam.

        If optional arg update_probabilities is False (the default is True),
        don't update word probabilities.  Updating them is expensive, and if
        you're going to pass many messages to learn(), it's more efficient
        to pass False here and call update_probabilities() once when you're
        done -- or to call learn() with update_probabilities=True when
        passing the last new example.  The important thing is that the
        probabilities get updated before calling spamprob() again.
        """

        self._add_msg(wordstream, is_spam)
        if update_probabilities:
            self.update_probabilities()

    def unlearn(self, wordstream, is_spam, update_probabilities=True):
        """In case of pilot error, call unlearn ASAP after screwing up.

        Pass the same arguments you passed to learn().
        """

        self._remove_msg(wordstream, is_spam)
        if update_probabilities:
            self.update_probabilities()

    def update_probabilities(self):
        """Update the word probabilities in the spam database.

        This computes a new probability for every word in the database,
        so can be expensive.  learn() and unlearn() update the probabilities
        each time by default.  Thay have an optional argument that allows
        to skip this step when feeding in many messages, and in that case
        you should call update_probabilities() after feeding the last
        message and before calling spamprob().
        """

        nham = float(self.nham or 1)
        nspam = float(self.nspam or 1)

        S = options.unknown_word_strength
        StimesX = S * options.unknown_word_prob

        for word, record in self.wordinfo.iteritems():
            # Compute prob(msg is spam | msg contains word).
            # This is the Graham calculation, but stripped of biases, and
            # stripped of clamping into 0.01 thru 0.99.  The Bayesian
            # adjustment following keeps them in a sane range, and one
            # that naturally grows the more evidence there is to back up
            # a probability.
            hamcount = record.hamcount
            assert hamcount <= nham
            hamratio = hamcount / nham

            spamcount = record.spamcount
            assert spamcount <= nspam
            spamratio = spamcount / nspam

            prob = spamratio / (hamratio + spamratio)

            # Now do Robinson's Bayesian adjustment.
            #
            #         s*x + n*p(w)
            # f(w) = --------------
            #           s + n
            #
            # I find this easier to reason about like so (equivalent when
            # s != 0):
            #
            #        x - p
            #  p +  -------
            #       1 + n/s
            #
            # IOW, it moves p a fraction of the distance from p to x, and
            # less so the larger n is, or the smaller s is.

            n = hamcount + spamcount
            prob = (StimesX + n * prob) / (S + n)

            if record.spamprob != prob:
                record.spamprob = prob
                # The next seemingly pointless line appears to be a hack
                # to allow a persistent db to realize the record has changed.
                self.wordinfo[word] = record

    def clearjunk(self, oldesttime):
        """Forget useless wordinfo records.  This can shrink the database size.

        A record for a word will be retained only if the word was accessed
        at or after oldesttime.
        """

        wordinfo = self.wordinfo
        mincount = float(mincount)
        tonuke = [w for w, r in wordinfo.iteritems() if r.atime < oldesttime]
        for w in tonuke:
            del wordinfo[w]

    # NOTE:  Graham's scheme had a strange asymmetry:  when a word appeared
    # n>1 times in a single message, training added n to the word's hamcount
    # or spamcount, but predicting scored words only once.  Tests showed
    # that adding only 1 in training, or scoring more than once when
    # predicting, hurt under the Graham scheme.
    # This isn't so under Robinson's scheme, though:  results improve
    # if training also counts a word only once.  The mean ham score decreases
    # significantly and consistently, ham score variance decreases likewise,
    # mean spam score decreases (but less than mean ham score, so the spread
    # increases), and spam score variance increases.
    # I (Tim) speculate that adding n times under the Graham scheme helped
    # because it acted against the various ham biases, giving frequently
    # repeated spam words (like "Viagra") a quick ramp-up in spamprob; else,
    # adding only once in training, a word like that was simply ignored until
    # it appeared in 5 distinct training hams.  Without the ham-favoring
    # biases, though, and never ignoring words, counting n times introduces
    # a subtle and unhelpful bias.
    # There does appear to be some useful info in how many times a word
    # appears in a msg, but distorting spamprob doesn't appear a correct way
    # to exploit it.
    def _add_msg(self, wordstream, is_spam):
        if is_spam:
            self.nspam += 1
        else:
            self.nham += 1

        wordinfo = self.wordinfo
        wordinfoget = wordinfo.get
        now = time.time()
        for word in self._get_all_tokens(wordstream):
            record = wordinfoget(word)
            if record is None:
                record = self.WordInfoClass(now)

            if is_spam:
                record.spamcount += 1
            else:
                record.hamcount += 1
            # Needed to tell a persistent DB that the content changed.
            wordinfo[word] = record

    def _remove_msg(self, wordstream, is_spam):
        if is_spam:
            if self.nspam <= 0:
                raise ValueError("spam count would go negative!")
            self.nspam -= 1
        else:
            if self.nham <= 0:
                raise ValueError("non-spam count would go negative!")
            self.nham -= 1

        wordinfo = self.wordinfo
        wordinfoget = wordinfo.get
        for word in self._get_all_tokens(wordstream):
            record = wordinfoget(word)
            if record is not None:
                if is_spam:
                    if record.spamcount > 0:
                        record.spamcount -= 1
                else:
                    if record.hamcount > 0:
                        record.hamcount -= 1
                if record.hamcount == 0 == record.spamcount:
                    del wordinfo[word]
                else:
                    # Needed to tell a persistent DB that the content changed.
                    wordinfo[word] = record

    def _getclues(self, wordstream):
        mindist = options.minimum_prob_strength
        unknown = options.unknown_word_prob

        rawclues = []
        pushclue = rawclues.append

        wordinfoget = self.wordinfo.get
        now = time.time()
        w1 = 'BOM'
        pos = 0
        for w2 in self._wrap_wordstream(wordstream):
            pos += 1
            first2 = w1 + " " + w2
            endpos = pos
            for word in w1, first2:
                endpos += 1
                record = wordinfoget(word)
                if record is None:
                    prob = unknown
                else:
                    record.atime = now
                    prob = record.spamprob
                distance = abs(prob - 0.5)
                if distance >= mindist:
                    pushclue((-distance, prob, word, record, pos, endpos))
            w1 = w2

        rawclues.sort()
        clues = []
        pushclue = clues.append
        atmost = options.max_discriminators
        wordseen = {}
        posseen = [0] * (pos + 4)
        for junk, prob, word, record, pos, endpos in rawclues:
            if word in wordseen:
                continue
            skip = 0
            for i in range(pos, endpos):
                if posseen[i]:
                    skip = 1
                    break
            if skip:
                continue
            pushclue((prob, word, record))
            wordseen[word] = 1
            for i in range(pos, endpos):
                posseen[i] = 1
            if len(clues) >= atmost:
                break
        # Return (prob, word, record).
        return clues

    def _wrap_wordstream(self, wordstream):
        for w in wordstream:
            yield w
        yield "EOM"

    def _get_all_tokens(self, wordstream):
        seen = {}
        w1 = 'BOM'
        for w2 in self._wrap_wordstream(wordstream):
            first2 = w1 + " " + w2
            for word in w1, first2:
                if word not in seen:
                    seen[word] = 1
                    yield word
            w1 = w2