[Spambayes] Seeking a giant idle machine w/ a miserable corpus
Tim Peters
tim.one@comcast.net
Sun Nov 17 05:19:55 2002
[Tim]
> ...
> The "missing test" here is exact bigrams (no hash convolutions). I'll
> try that later; may not have enough RAM for that, but should.
I didn't, but by cutting 6,000 ham out of my test data managed to complete
the test in < 3 hours with lots of disk thrashing. cv is baseline, bi21 the
bigram gimmick w/ 2**21 hash buckets, and bix the exact bigram run:
filename: cv bi21 bix
ham:spam: 20000:14000 14000:14000
20000:14000
fp total: 3 0 0
fp %: 0.01 0.00 0.00
fn total: 0 0 0
fn %: 0.00 0.00 0.00
unsure t: 91 300 98
unsure %: 0.27 0.88 0.35
real cost: $48.20 $60.00 $19.60
best cost: $17.80 $23.20 $6.80
h mean: 0.24 0.25 0.25
h sdev: 2.73 2.61 2.82
s mean: 99.95 99.41 99.91
s sdev: 1.40 4.67 1.94
mean diff: 99.71 99.16 99.66
k: 24.14 13.62 20.94
There are simply no surprises in the bix output; under the covers it's
beautiful:
-> <stat> Ham scores for all runs: 14000 items; mean 0.25; sdev 2.82
-> <stat> min 0; median 3.88578e-014; max 69.8223
-> <stat> percentiles: 5% 0; 25% 0; 75% 7.29026e-009; 95% 0.00817516
3 ham scored between 50 and 50.5; 1 ham scored 69.8; all other ham scored
below 50.
-> <stat> Spam scores for all runs: 14000 items; mean 99.91; sdev 1.94
-> <stat> min 30.4227; median 100; max 100
-> <stat> percentiles: 5% 100; 25% 100; 75% 100; 95% 100
2 apam scored between 49.5 and 50.5; 1 spam scored 30.4; all other spam
scored above 50.
The best-cost cutoffs relect this sharp separation:
-> best cost for all runs: $6.80
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at 9 cutoff pairs
-> smallest ham & spam cutoffs 0.495 & 0.7
-> fp 0; fn 1; unsure ham 10; unsure spam 19
-> fp rate 0%; fn rate 0.00714%; unsure rate 0.104%
-> largest ham & spam cutoffs 0.495 & 0.74
-> fp 0; fn 1; unsure ham 10; unsure spam 19
-> fp rate 0%; fn rate 0.00714%; unsure rate 0.104%
The highest-scoring ham is the only reason for the high suggested spam
cuttoff (else .51 would have worked fine and eliminated almost all the
unsures), and was the 2nd-worst in the baseline test: poor Vickie Mills
suffering from her obnoxious employer-generated sig. Bigrams helped her,
but it's unclear why(!). Her three strongest ham clues:
prob('noheader:return-path noheader:abuse-reports-to') = 0.000364471
prob('subject:Python') = 0.00116829
prob('header:From:1 header:MIME-Version:1') = 0.00129855
Now the order in which noheader metatokens get generated is an accident
inherited from the order in which a Set happens to enumerate its elements.
What I fear here is that I've stumbled into a too-good systematic difference
between c.l.py headers and BruceG's spam headers: not in the *set* of
header lines they contain (I'm ignoring all headers that might matter), but
something much subtler having to do with how the set of all header lines in
existence may affect dict iteration order for the few headers I actually
look at. Bigrams involving headers lines are striking in the test output;
e.g., this one is a strong spam clue, and is almost as mysterious:
prob('header:Subject:1 noheader:errors-to') = 0.96321
If we pursue this approach, this will take much thought.
The lowest-scoring spam was the one with a uuencoded text body we throw away
unlooked-at. A header bigram boosted its spam score in a clearly helpful
way:
prob('from:addr:hotmail.com from:no real name:2**0') = 0.985839
Indeed, something sent from hotmail without a real name reeks to heaven of
spam, much more so than hotmail alone or no-real-name alone.
OTOH, check out the 6 strongest ham clues for the same spam:
prob('header:X-Complaints-To:1') = 4.00324e-005
prob('header:From:1 header:Date:1') = 0.000279524
prob('header:Subject:1 noheader:received') = 0.000382997
prob('noheader:x-face noheader:return-path') = 0.00045433
prob('header:Organization:1 header:Message-ID:1') = 0.00137332
prob('noheader:in-reply-to noheader:reply-to') = 0.0110694
The X-Complaints-To header has always helped this guy a lot, but why the 5
new header-bigram combos here are hammish remains a mystery.
The memory burden of this run is also a mystery. I played with mixing
unigrams and bigrams before, and recall the c.l.py test topping out at about
120MB. This run was over 256MB (hence the massive swapping on my 256MB
box), and it wasn't even a full run. A difference is that, in my previous
runs, the *tokenizer* generated the unigrams and bigrams, and only for the
body. In this run the classifier generated them, and header tokens got into
the mix too. I suppose header bigrams are large (as strings), and that
there are a lot of them -- heck, for lots of msgs, even just the text of the
headers I look at here is larger than the msg bodies.
So while this scheme may have real promise, mysteries and practical problems
abound. I'm out of time for looking at this. If someone wants to pursue
it, I'll attach the classifier I used. Based on everything I've done here,
two suggestions:
1. Every time we've tried them, hash schemes have been unsatisfying,
due to the human-incomprehensible mistakes they make. You're
unlikely to see mistakes on a small run, though -- this is a
percentage game that *eventually* loses big.
2. I've got some evidence to believe that exact bigrams may help, but
saw nothing in the exact trigram runs to suggest they buy
anything over that. Trigrams helped on-topic c.l.py conference
announcements, but they also hurt them, and *that* class of
problem msg is already solidly Unsure under the unigram scheme.
-------------- next part --------------
# An implementation of a Bayes-like spam classifier.
#
# Paul Graham's original description:
#
# http://www.paulgraham.com/spam.html
#
# A highly fiddled version of that can be retrieved from our CVS repository,
# via tag Last-Graham. This made many demonstrated improvements in error
# rates over Paul's original description.
#
# This code implements Gary Robinson's suggestions, the core of which are
# well explained on his webpage:
#
# http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
#
# This is theoretically cleaner, and in testing has performed at least as
# well as our highly tuned Graham scheme did, often slightly better, and
# sometimes much better. It also has "a middle ground", which people like:
# the scores under Paul's scheme were almost always very near 0 or very near
# 1, whether or not the classification was correct. The false positives
# and false negatives under Gary's basic scheme (use_gary_combining) generally
# score in a narrow range around the corpus's best spam_cutoff value.
# However, it doesn't appear possible to guess the best spam_cutoff value in
# advance, and it's touchy.
#
# The chi-combining scheme used by default here gets closer to the theoretical
# basis of Gary's combining scheme, and does give extreme scores, but also
# has a very useful middle ground (small # of msgs spread across a large range
# of scores, and good cutoff values aren't touchy).
#
# This implementation is due to Tim Peters et alia.
from __future__ import generators
import math
import time
from sets import Set
from Options import options
from chi2 import chi2Q
try:
True, False
except NameError:
# Maintain compatibility with Python 2.2
True, False = 1, 0
LN2 = math.log(2) # used frequently by chi-combining
PICKLE_VERSION = 1
class WordInfo(object):
__slots__ = ('atime', # when this record was last used by scoring(*)
'spamcount', # # of spams in which this word appears
'hamcount', # # of hams in which this word appears
'killcount', # # of times this made it to spamprob()'s nbest
'spamprob', # prob(spam | msg contains this word)
)
# Invariant: For use in a classifier database, at least one of
# spamcount and hamcount must be non-zero.
#
# (*)atime is the last access time, a UTC time.time() value. It's the
# most recent time this word was used by scoring (i.e., by spamprob(),
# not by training via learn()); or, if the word has never been used by
# scoring, the time the word record was created (i.e., by learn()).
# One good criterion for identifying junk (word records that have no
# value) is to delete words that haven't been used for a long time.
# Perhaps they were typos, or unique identifiers, or relevant to a
# once-hot topic or scam that's fallen out of favor. Whatever, if
# a word is no longer being used, it's just wasting space.
def __init__(self, atime, spamprob=options.unknown_word_prob):
self.atime = atime
self.spamcount = self.hamcount = self.killcount = 0
self.spamprob = spamprob
def __repr__(self):
return "WordInfo%r" % repr((self.atime, self.spamcount,
self.hamcount, self.killcount,
self.spamprob))
def __getstate__(self):
return (self.atime, self.spamcount, self.hamcount, self.killcount,
self.spamprob)
def __setstate__(self, t):
(self.atime, self.spamcount, self.hamcount, self.killcount,
self.spamprob) = t
class Bayes:
# Defining __slots__ here made Jeremy's life needlessly difficult when
# trying to hook this all up to ZODB as a persistent object. There's
# no space benefit worth getting from slots in this class; slots were
# used solely to help catch errors earlier, when this code was changing
# rapidly.
#__slots__ = ('wordinfo', # map word to WordInfo record
# 'nspam', # number of spam messages learn() has seen
# 'nham', # number of non-spam messages learn() has seen
# )
# allow a subclass to use a different class for WordInfo
WordInfoClass = WordInfo
def __init__(self):
self.wordinfo = {}
self.nspam = self.nham = 0
def __getstate__(self):
return PICKLE_VERSION, self.wordinfo, self.nspam, self.nham
def __setstate__(self, t):
if t[0] != PICKLE_VERSION:
raise ValueError("Can't unpickle -- version %s unknown" % t[0])
self.wordinfo, self.nspam, self.nham = t[1:]
# spamprob() implementations. One of the following is aliased to
# spamprob, depending on option settings.
def gary_spamprob(self, wordstream, evidence=False):
"""Return best-guess probability that wordstream is spam.
wordstream is an iterable object producing words.
The return value is a float in [0.0, 1.0].
If optional arg evidence is True, the return value is a pair
probability, evidence
where evidence is a list of (word, probability) pairs.
"""
from math import frexp
# This combination method is due to Gary Robinson; see
# http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
# The real P = this P times 2**Pexp. Likewise for Q. We're
# simulating unbounded dynamic float range by hand. If this pans
# out, *maybe* we should store logarithms in the database instead
# and just add them here. But I like keeping raw counts in the
# database (they're easy to understand, manipulate and combine),
# and there's no evidence that this simulation is a significant
# expense.
P = Q = 1.0
Pexp = Qexp = 0
clues = self._getclues(wordstream)
for prob, word, record in clues:
if record is not None: # else wordinfo doesn't know about it
record.killcount += 1
P *= 1.0 - prob
Q *= prob
if P < 1e-200: # move back into range
P, e = frexp(P)
Pexp += e
if Q < 1e-200: # move back into range
Q, e = frexp(Q)
Qexp += e
P, e = frexp(P)
Pexp += e
Q, e = frexp(Q)
Qexp += e
num_clues = len(clues)
if num_clues:
#P = 1.0 - P**(1./num_clues)
#Q = 1.0 - Q**(1./num_clues)
#
# (x*2**e)**n = x**n * 2**(e*n)
n = 1.0 / num_clues
P = 1.0 - P**n * 2.0**(Pexp * n)
Q = 1.0 - Q**n * 2.0**(Qexp * n)
# (P-Q)/(P+Q) is in -1 .. 1; scaling into 0 .. 1 gives
# ((P-Q)/(P+Q)+1)/2 # ((P-Q+P-Q)/(P+Q)/2 # (2*P/(P+Q)/2 # P/(P+Q)
prob = P/(P+Q)
else:
prob = 0.5
if evidence:
clues = [(w, p) for p, w, r in clues]
clues.sort(lambda a, b: cmp(a[1], b[1]))
return prob, clues
else:
return prob
if options.use_gary_combining:
spamprob = gary_spamprob
# Across vectors of length n, containing random uniformly-distributed
# probabilities, -2*sum(ln(p_i)) follows the chi-squared distribution
# with 2*n degrees of freedom. This has been proven (in some
# appropriate sense) to be the most sensitive possible test for
# rejecting the hypothesis that a vector of probabilities is uniformly
# distributed. Gary Robinson's original scheme was monotonic *with*
# this test, but skipped the details. Turns out that getting closer
# to the theoretical roots gives a much sharper classification, with
# a very small (in # of msgs), but also very broad (in range of scores),
# "middle ground", where most of the mistakes live. In particular,
# this scheme seems immune to all forms of "cancellation disease": if
# there are many strong ham *and* spam clues, this reliably scores
# close to 0.5. Most other schemes are extremely certain then -- and
# often wrong.
def chi2_spamprob(self, wordstream, evidence=False):
"""Return best-guess probability that wordstream is spam.
wordstream is an iterable object producing words.
The return value is a float in [0.0, 1.0].
If optional arg evidence is True, the return value is a pair
probability, evidence
where evidence is a list of (word, probability) pairs.
"""
from math import frexp, log as ln
# We compute two chi-squared statistics, one for ham and one for
# spam. The sum-of-the-logs business is more sensitive to probs
# near 0 than to probs near 1, so the spam measure uses 1-p (so
# that high-spamprob words have greatest effect), and the ham
# measure uses p directly (so that lo-spamprob words have greatest
# effect).
#
# For optimization, sum-of-logs == log-of-product, and f.p.
# multiplication is a lot cheaper than calling ln(). It's easy
# to underflow to 0.0, though, so we simulate unbounded dynamic
# range via frexp. The real product H = this H * 2**Hexp, and
# likewise the real product S = this S * 2**Sexp.
H = S = 1.0
Hexp = Sexp = 0
clues = self._getclues(wordstream)
for prob, word, record in clues:
if record is not None: # else wordinfo doesn't know about it
record.killcount += 1
S *= 1.0 - prob
H *= prob
if S < 1e-200: # prevent underflow
S, e = frexp(S)
Sexp += e
if H < 1e-200: # prevent underflow
H, e = frexp(H)
Hexp += e
# Compute the natural log of the product = sum of the logs:
# ln(x * 2**i) = ln(x) + i * ln(2).
S = ln(S) + Sexp * LN2
H = ln(H) + Hexp * LN2
n = len(clues)
if n:
S = 1.0 - chi2Q(-2.0 * S, 2*n)
H = 1.0 - chi2Q(-2.0 * H, 2*n)
# How to combine these into a single spam score? We originally
# used (S-H)/(S+H) scaled into [0., 1.], which equals S/(S+H). A
# systematic problem is that we could end up being near-certain
# a thing was (for example) spam, even if S was small, provided
# that H was much smaller.
# Rob Hooft stared at these problems and invented the measure
# we use now, the simpler S-H, scaled into [0., 1.].
prob = (S-H + 1.0) / 2.0
else:
prob = 0.5
if evidence:
clues = [(w, p) for p, w, r in clues]
clues.sort(lambda a, b: cmp(a[1], b[1]))
clues.insert(0, ('*S*', S))
clues.insert(0, ('*H*', H))
return prob, clues
else:
return prob
if options.use_chi_squared_combining:
spamprob = chi2_spamprob
def learn(self, wordstream, is_spam, update_probabilities=True):
"""Teach the classifier by example.
wordstream is a word stream representing a message. If is_spam is
True, you're telling the classifier this message is definitely spam,
else that it's definitely not spam.
If optional arg update_probabilities is False (the default is True),
don't update word probabilities. Updating them is expensive, and if
you're going to pass many messages to learn(), it's more efficient
to pass False here and call update_probabilities() once when you're
done -- or to call learn() with update_probabilities=True when
passing the last new example. The important thing is that the
probabilities get updated before calling spamprob() again.
"""
self._add_msg(wordstream, is_spam)
if update_probabilities:
self.update_probabilities()
def unlearn(self, wordstream, is_spam, update_probabilities=True):
"""In case of pilot error, call unlearn ASAP after screwing up.
Pass the same arguments you passed to learn().
"""
self._remove_msg(wordstream, is_spam)
if update_probabilities:
self.update_probabilities()
def update_probabilities(self):
"""Update the word probabilities in the spam database.
This computes a new probability for every word in the database,
so can be expensive. learn() and unlearn() update the probabilities
each time by default. Thay have an optional argument that allows
to skip this step when feeding in many messages, and in that case
you should call update_probabilities() after feeding the last
message and before calling spamprob().
"""
nham = float(self.nham or 1)
nspam = float(self.nspam or 1)
S = options.unknown_word_strength
StimesX = S * options.unknown_word_prob
for word, record in self.wordinfo.iteritems():
# Compute prob(msg is spam | msg contains word).
# This is the Graham calculation, but stripped of biases, and
# stripped of clamping into 0.01 thru 0.99. The Bayesian
# adjustment following keeps them in a sane range, and one
# that naturally grows the more evidence there is to back up
# a probability.
hamcount = record.hamcount
assert hamcount <= nham
hamratio = hamcount / nham
spamcount = record.spamcount
assert spamcount <= nspam
spamratio = spamcount / nspam
prob = spamratio / (hamratio + spamratio)
# Now do Robinson's Bayesian adjustment.
#
# s*x + n*p(w)
# f(w) = --------------
# s + n
#
# I find this easier to reason about like so (equivalent when
# s != 0):
#
# x - p
# p + -------
# 1 + n/s
#
# IOW, it moves p a fraction of the distance from p to x, and
# less so the larger n is, or the smaller s is.
n = hamcount + spamcount
prob = (StimesX + n * prob) / (S + n)
if record.spamprob != prob:
record.spamprob = prob
# The next seemingly pointless line appears to be a hack
# to allow a persistent db to realize the record has changed.
self.wordinfo[word] = record
def clearjunk(self, oldesttime):
"""Forget useless wordinfo records. This can shrink the database size.
A record for a word will be retained only if the word was accessed
at or after oldesttime.
"""
wordinfo = self.wordinfo
mincount = float(mincount)
tonuke = [w for w, r in wordinfo.iteritems() if r.atime < oldesttime]
for w in tonuke:
del wordinfo[w]
# NOTE: Graham's scheme had a strange asymmetry: when a word appeared
# n>1 times in a single message, training added n to the word's hamcount
# or spamcount, but predicting scored words only once. Tests showed
# that adding only 1 in training, or scoring more than once when
# predicting, hurt under the Graham scheme.
# This isn't so under Robinson's scheme, though: results improve
# if training also counts a word only once. The mean ham score decreases
# significantly and consistently, ham score variance decreases likewise,
# mean spam score decreases (but less than mean ham score, so the spread
# increases), and spam score variance increases.
# I (Tim) speculate that adding n times under the Graham scheme helped
# because it acted against the various ham biases, giving frequently
# repeated spam words (like "Viagra") a quick ramp-up in spamprob; else,
# adding only once in training, a word like that was simply ignored until
# it appeared in 5 distinct training hams. Without the ham-favoring
# biases, though, and never ignoring words, counting n times introduces
# a subtle and unhelpful bias.
# There does appear to be some useful info in how many times a word
# appears in a msg, but distorting spamprob doesn't appear a correct way
# to exploit it.
def _add_msg(self, wordstream, is_spam):
if is_spam:
self.nspam += 1
else:
self.nham += 1
wordinfo = self.wordinfo
wordinfoget = wordinfo.get
now = time.time()
for word in self._get_all_tokens(wordstream):
record = wordinfoget(word)
if record is None:
record = self.WordInfoClass(now)
if is_spam:
record.spamcount += 1
else:
record.hamcount += 1
# Needed to tell a persistent DB that the content changed.
wordinfo[word] = record
def _remove_msg(self, wordstream, is_spam):
if is_spam:
if self.nspam <= 0:
raise ValueError("spam count would go negative!")
self.nspam -= 1
else:
if self.nham <= 0:
raise ValueError("non-spam count would go negative!")
self.nham -= 1
wordinfo = self.wordinfo
wordinfoget = wordinfo.get
for word in self._get_all_tokens(wordstream):
record = wordinfoget(word)
if record is not None:
if is_spam:
if record.spamcount > 0:
record.spamcount -= 1
else:
if record.hamcount > 0:
record.hamcount -= 1
if record.hamcount == 0 == record.spamcount:
del wordinfo[word]
else:
# Needed to tell a persistent DB that the content changed.
wordinfo[word] = record
def _getclues(self, wordstream):
mindist = options.minimum_prob_strength
unknown = options.unknown_word_prob
rawclues = []
pushclue = rawclues.append
wordinfoget = self.wordinfo.get
now = time.time()
w1 = 'BOM'
pos = 0
for w2 in self._wrap_wordstream(wordstream):
pos += 1
first2 = w1 + " " + w2
endpos = pos
for word in w1, first2:
endpos += 1
record = wordinfoget(word)
if record is None:
prob = unknown
else:
record.atime = now
prob = record.spamprob
distance = abs(prob - 0.5)
if distance >= mindist:
pushclue((-distance, prob, word, record, pos, endpos))
w1 = w2
rawclues.sort()
clues = []
pushclue = clues.append
atmost = options.max_discriminators
wordseen = {}
posseen = [0] * (pos + 4)
for junk, prob, word, record, pos, endpos in rawclues:
if word in wordseen:
continue
skip = 0
for i in range(pos, endpos):
if posseen[i]:
skip = 1
break
if skip:
continue
pushclue((prob, word, record))
wordseen[word] = 1
for i in range(pos, endpos):
posseen[i] = 1
if len(clues) >= atmost:
break
# Return (prob, word, record).
return clues
def _wrap_wordstream(self, wordstream):
for w in wordstream:
yield w
yield "EOM"
def _get_all_tokens(self, wordstream):
seen = {}
w1 = 'BOM'
for w2 in self._wrap_wordstream(wordstream):
first2 = w1 + " " + w2
for word in w1, first2:
if word not in seen:
seen[word] = 1
yield word
w1 = w2
More information about the Spambayes
mailing list