From tim_one@users.sourceforge.net Thu Sep 5 21:17:34 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 05 Sep 2002 13:17:34 -0700
Subject: [Spambayes-checkins] spambayes README.txt,NONE,1.1
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv25267
Added Files:
README.txt
Log Message:
Some sorely needed clues.
--- NEW FILE: README.txt ---
Assorted clues.
What's Here?
============
Lots of mondo cool undocumented code. What else could there be ?
The focus of this project so far has not been to produce the fastest or
smallest filters, but to set up a flexible pure-Python implementation
for doing algorithm research. Lots of people are making fast/small
implementations, and it takes an entirely different kind of effort to
make genuine algorithm improvements. I think we've done quite well at
that so far. The focus of this codebase may change to small/fast
later -- as is, the false positive rate has gotten too small to measure
reliably across test sets with 4000 hams + 2750 spams, but the false
negative rate is still over 1%.
Primary Files
=============
classifier.py
An implementation of a Graham-like classifier.
Tester.py
A test-driver class that feeds streams of msgs to a classifier
instance, and keeps track of right/wrong percentages, and lists
of false positives and false negatives.
timtest.py
A concrete test driver and tokenizer that uses Tester and
classifier (above). This assumes "a standard" test data setup
(see below). Could stand massive refactoring.
GBayes.py
A number of tokenizers and a partial test driver. This assumes
an mbox format. Could stand massive refactoring. I don't think
it's been kept up to date.
Test Data Utilities
===================
rebal.py
Evens out the number of messages in "standard" test data folders (see
below).
cleanarch
A script to repair mbox archives by finding "From" lines that
should have been escaped, and escaping them.
mboxcount.py
Count the number of messages (both parseable and unparseable) in
mbox archives.
split.py
splitn.py
Split an mbox into random pieces in various ways. Tim recommends
using "the standard" test data set up instead (see below).
Standard Test Data Setup
========================
Barry gave me mboxes, but the spam corpus I got off the web had one spam
per file, and it only took two days of extreme pain to realize that one msg
per file is enormously easier to work with when testing: you want to split
these at random into random collections, you may need to replace some at
random when testing reveals spam mistakenly called ham (and vice versa),
etc -- even pasting examples into email is much easier when it's one msg
per file (and the test driver makes it easy to print a msg's file path).
The directory structure under my spambayes directory looks like so:
Data/
Spam/
Set1/ (contains 2750 spam .txt files)
Set2/ ""
Set3/ ""
Set4/ ""
Set5/ ""
Ham/
Set1/ (contains 4000 ham .txt files)
Set2/ ""
Set3/ ""
Set4/ ""
Set5/ ""
reservoir/ (contains "backup ham")
If you use the same names and structure, huge mounds of the tedious testing
code will work as-is. The more Set directories the merrier, although
you'll hit a point of diminishing returns if you exceed 10. The "reservoir"
directory contains a few thousand other random hams. When a ham is found
that's really spam, I delete it, and then the rebal.py utility moves in a
message at random from the reservoir to replace it. If I had it to do over
again, I think I'd move such spam into a Spam set (chosen at random),
instead of deleting it.
The hams are 20,000 msgs selected at random from a python-list archive.
The spams are essentially all of Bruce Guenter's 2002 spam archive:
The sets are grouped into 5 pairs in the obvious way: Spam/Set1 with
Ham/Set1, and so on. For each such pair, timtest trains a classifier on
that pair, then runs predictions on each of the other 4 pairs. In effect,
it's a 5x5 test grid, skipping the diagonal. There's no particular reason
to avoid predicting against the same set trained on, except that it
takes more time and seems the least interesting thing to try.
From tim_one@users.sourceforge.net Thu Sep 5 21:55:04 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 05 Sep 2002 13:55:04 -0700
Subject: [Spambayes-checkins] spambayes TESTING.txt,NONE,1.1
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv7401
Added Files:
TESTING.txt
Log Message:
Adapted more python-dev msgs into clue form.
--- NEW FILE: TESTING.txt ---
[Clues about the practice of statistical testing, adapted from Tim's
comments on python-dev.]
Combining pairs of words is called "word bigrams". My intuition at the
start was that it would do better. OTOH, my intuition also was that
character n-grams for a relatively large n would do better still. The
latter may be so for "foreign" languages, but for this particular task using
Graham's scheme on the c.l.py tests, turns out they sucked. A comment block
in timtest.py explains why.
I didn't try word bigrams because the f-p rate is already supernaturally
low, so there doesn't seem anything left to be gained there. This echoes
what Graham sez on his web page:
One idea that I haven't tried yet is to filter based on word pairs, or
even triples, rather than individual words. This should yield a much
sharper estimate of the probability.
My comment with benefit of hindsight: it doesn't. Because the scoring
scheme throws away everything except about a dozen extremes, the
"probabilities" that come out are almost always very near 0 or very near 1;
only very short or (or especially "and") very bland msgs come out in
between. This outcome is largely independent of the tokenization scheme --
the scoring scheme forces it, provided only that the tokenization scheme
produces stuff *some* of which *does* vary in frequency between spam and
ham.
For example, in my current database, the word "offers" has a
probability of .96. If you based the probabilities on word pairs, you'd
end up with "special offers" and "valuable offers" having probabilities
of .99 and, say, "approach offers" (as in "this approach offers")
having a probability of .1 or less.
The theory is indeed appealing .
The reason I haven't done this is that filtering based on individual
words already works so well.
Which is also the reason I didn't pursue it.
But it does mean that there is room to tighten the filters if spam gets
harder to detect.
I expect it would also need a different scoring scheme then.
OK, I ran a full test using word bigrams. It gets one strike against it at
the start because the database size grows by a factor between 2 and 3.
That's only justified if the results are better. Before-and-after f-p
(false positive) percentages:
before bigrams
0.000 0.025
0.000 0.025
0.050 0.050
0.000 0.025
0.025 0.050
0.025 0.100
0.050 0.075
0.025 0.025
0.025 0.050
0.000 0.025
0.075 0.050
0.050 0.000
0.025 0.050
0.000 0.025
0.050 0.075
0.025 0.025
0.025 0.025
0.000 0.000
0.025 0.050
0.050 0.025
Lost on 12 runs
Tied on 5 runs
Won on 3 runs
total # of unique fps across all runs rose from 8 to 17
The f-n percentages on the same runs:
before bigrams
1.236 1.091
1.164 1.091
1.454 1.708
1.599 1.563
1.527 1.491
1.236 1.127
1.163 1.345
1.309 1.309
1.891 1.927
1.418 1.382
1.745 1.927
1.708 1.963
1.491 1.782
0.836 0.800
1.091 1.127
1.309 1.309
1.491 1.709
1.127 1.018
1.309 1.018
1.636 1.672
Lost on 9 runs
Tied on 2 runs
Won on 9 runs
total # of unique fns across all runs rose from 336 to 350
This doesn't need deep analysis: it costs more, and on the face of it
either doesn't help, or helps so little it's not worth the cost.
Now I'll tell you in confidence that the way to make a scheme like
this excellent is to keep your ego out of it and let the data *tell* you
what works: getting the best test setup you can is the most important thing
you can possibly do. It must include multiple training and test corpora
(e.g., if I had used only one pair, I would have had a 3/20 chance of
erroneously concluding that bigrams might help the f-p rate, when running
across 20 pairs shows that they almost certainly do it harm; while I would
have had an even chance of drawing a wrong conclusion-- in either direction
--about the effect on the f-n rate).
The second most important thing is to run a fat test all the way to the end
before concluding anything. A subtler point is that you should never keep
a change that doesn't *prove* itself a winner: neutral changes bloat your
code with proven irrelevancies that will come back to make your life harder
later, in part because they'll randomly interfere with future changes in
ways that make it harder to recognize a significant change when you stumble
into one.
Most things you try won't help -- indeed, many of them will deliver worse
results. I dare say my intuition for this kind of classification task is
better than most programmers' (in part because I had years of professional
experience in a related field), and most of the things I tried I had to
throw away. BFD -- then you try something else. When I find something
that works I can rationalize it, but when I try something that doesn't, no
amount of argument can change that the data said it sucked .
Two things about *this* task have fooled me repeatedly:
1. The "only look at smoking guns" nature of the scoring step makes many
kinds of "on average" intuitions worthless: "on average" almost
everything is thrown away! For example, you're not going to find bad
results reported for n-grams (neither character- nor word-based) in the
literature, and because most scoring schemes throw much less away.
Graham's scheme strikes me as brilliant in this specific respect: it's
worth enduring the ego humiliation to get such a spectacularly
low f-p rate from such simple and fast code. Graham's assumption
that the spam-vs-ham distinction should be *easy* pays off big.
2. Most mailing-list messages are much shorter than this one. This
systematically frustrates "well, averaged over enough words" intuitions
too.
Cute: In particular, word bigrams systematically hate conference
announcements. The current word one-gram scheme hated them too, until I
started folding case. Then their SCREAMING stopped acting against them.
But they're still using the language of advertisement, and word bigrams
can't help but notice that more strongly than individual words do.
Here from the TOOLS Europe '99 announcement:
prob('more information') = 0.916003
prob('web site') = 0.895518
prob('please write') = 0.99
prob('you wish') = 0.984494
prob('our web') = 0.985578
prob('visit our') = 0.99
Here from the XP2001 - FINAL CALL FOR PAPERS:
prob('web site:') = 0.926174
prob('receive this') = 0.945813
prob('you receive') = 0.987542
prob('most exciting') = 0.99
prob('alberta, canada') = 0.99
prob('e-mail to:') = 0.99
Here from the XP2002 - CALL FOR PRACTITIONER'S REPORTS ('BOM' is an
artificial token I made up for "beginning of message", to give something
for the first word in the message to pair up with):
prob('web site:') = 0.926174
prob('this announcement') = 0.94359
prob('receive this') = 0.945813
prob('forward this') = 0.99
prob('e-mail to:') = 0.99
prob('BOM *****') = 0.99
prob('you receive') = 0.987542
Here from the TOOLS Europe 2000 announcement:
prob('visit the') = 0.96
prob('you receive') = 0.967805
prob('accept our') = 0.99
prob('our apologies') = 0.99
prob('quality and') = 0.99
prob('receive more') = 0.99
prob('asia and') = 0.99
A vanilla f-p showing where bigrams can hurt was a short msg about setting
up a Python user's group. Bigrams gave it large penalties for phrases like
"fully functional" (most often seen in spams for bootleg software, but here
applied to the proposed user group's web site -- and "web site" is also a
strong spam indicator!). OTOH, the poster also said "Aahz rocks". As a
bigram, that neither helped nor hurt (that 2-word phrase is unique in the
corpus); but as an individual word, "Aahz" is a strong non-spam indicator
on c.l.py (and will probably remain so until he starts spamming ).
It did find one spam hiding in a ham corpus:
"""
NNTP-Posting-Host: 212.64.45.236
Newsgroups: comp.lang.python,comp.lang.rexx
Date: Thu, 21 Oct 1999 10:18:52 -0700
Message-ID: <67821AB23987D311ADB100A0241979E5396955@news.ykm.com>
From: znblrn@hetronet.com
Subject: Rudolph The Rednose Hooters Here
Lines: 4
Path: news!uunet!ffx.uu.net!newsfeed.fast.net!howland.erols.net!newsfeed.cwix.com!news.cfw.com!paxfeed.eni.net!DAIPUB.DataAssociatesInc..com
Xref: news comp.lang.python:74468 comp.lang.rexx:31946
To: python-list@python.org
THis IS it: The site where they talk about when you are 50 years old.
http://huizen.dds.nl/~jansen20
"""
there's-no-substitute-for-experiment-except-drugs-ly y'rs - tim
Other points:
+ Something I didn't do but should have: keep a detailed log of every
experiment run, and of the results you got. The only clues about dozens
of experiments with the current code are in brief "XXX" comment blocks,
and a bunch of test results were lost when we dropped the old checkin
comments on the way to moving this code to SourceForge.
+ Every time you check in an algorithmic change that proved to be a
winner, in theory you should also reconsider every previous change.
You really can't guess whether, e.g., tokenization changes are all
independent of each other, or whether some reinforce others in
helpful ways. In practice there's not enough time to reconsider
everything every time, but do make a habit of reconsidering *something*
each time you've had a success. Nothing is sacred except the results
in the end, and heresy can pay; every decision remains suspect forever.
+ Any sufficiently general scheme with enough free parameters can eventually
be trained to recognize any specific dataset exactly. It's wonderful
if other people test your changes against other datasets too. That's
hard to arrange, so at least change your own data periodically. I'm
suspicious that some of the weirder "proven winner" changes I've made
are really specific to statistical anomalies in my test data; and as
the error rates get closer to 0%, the chance that a winning change helped
only a few specific msgs zooms (of course sometimes that's intentional!
I haven't been shy about adding changes specifically geared toward
squahsing very narrow classes of false positives).
From tim_one@users.sourceforge.net Fri Sep 6 00:34:43 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 05 Sep 2002 16:34:43 -0700
Subject: [Spambayes-checkins] spambayes rates.py,NONE,1.1 README.txt,1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv8648
Modified Files:
README.txt
Added Files:
rates.py
Log Message:
Checking in one of the helper scripts I use to analyze test output.
--- NEW FILE: rates.py ---
"""
rates.py basename
Assuming that file
basename + '.txt'
contains output from timtest.py, scans that file for summary statistics,
displays them to stdout, and also writes them to file
basename + 's.txt'
(where the 's' means 'summary'). This doesn't need a full output file, and
will display stuff for as far as the output file has gotten so far.
Two of these summary files can later be fed to cmp.py.
"""
import re
import sys
"""
Training on Data/Ham/Set1 & Data/Spam/Set1 ... 4000 hams & 2750 spams
testing against Data/Ham/Set2 & Data/Spam/Set2 ... 4000 hams & 2750 spams
false positive: 0.025
false negative: 1.34545454545
new false positives: ['Data/Ham/Set2/66645.txt']
"""
pat1 = re.compile(r'\s*Training on Data/').match
pat2 = re.compile(r'\s+false (positive|negative): (.*)').match
pat3 = re.compile(r"\s+new false (positives|negatives): \[(.+)\]").match
def doit(basename):
ifile = file(basename + '.txt')
oname = basename + 's.txt'
ofile = file(oname, 'w')
print basename, '->', oname
def dump(*stuff):
msg = ' '.join(map(str, stuff))
print msg
print >> ofile, msg
nfn = nfp = 0
ntrainedham = ntrainedspam = 0
for line in ifile:
"Training on Data/Ham/Set1 & Data/Spam/Set1 ... 4000 hams & 2750 spams"
m = pat1(line)
if m:
dump(line[:-1])
fields = line.split()
ntrainedham += int(fields[-5])
ntrainedspam += int(fields[-2])
continue
"false positive: 0.025"
"false negative: 1.34545454545"
m = pat2(line)
if m:
kind, guts = m.groups()
guts = float(guts)
if kind == 'positive':
lastval = guts
else:
dump(' %7.3f %7.3f' % (lastval, guts))
continue
"new false positives: ['Data/Ham/Set2/66645.txt']"
m = pat3(line)
if m: # note that it doesn't match at all if the list is "[]"
kind, guts = m.groups()
n = len(guts.split())
if kind == 'positives':
nfp += n
else:
nfn += n
dump('total false pos', nfp, nfp * 1e2 / ntrainedham)
dump('total false neg', nfn, nfn * 1e2 / ntrainedspam)
for name in sys.argv[1:]:
doit(name)
Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** README.txt 5 Sep 2002 20:17:31 -0000 1.1
--- README.txt 5 Sep 2002 23:34:41 -0000 1.2
***************
*** 38,41 ****
--- 38,48 ----
+ Test Utilities
+ ==============
+ rates.py
+ Scans the output (so far) from timtest.py, and captures summary
+ statistics.
+
+
Test Data Utilities
===================
From tim_one@users.sourceforge.net Fri Sep 6 00:42:55 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 05 Sep 2002 16:42:55 -0700
Subject: [Spambayes-checkins] spambayes cmp.py,NONE,1.1 README.txt,1.2,1.3
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv11162
Modified Files:
README.txt
Added Files:
cmp.py
Log Message:
Checking in the script I use to produce listings of changes in f-p
and f-n rates between two test runs.
--- NEW FILE: cmp.py ---
"""
cmp.py sbase1 sbase2
Combines output from sbase1.txt and sbase2.txt, which are created by
rates.py from timtest.py output, and displays comparison statistics to
stdout.
"""
import sys
f1n, f2n = sys.argv[1:3]
NSETS = 5
# Return
# (list of all f-p rates,
# list of all f-n rates,
# total f-p,
# total f-n)
# from summary file f.
def suck(f):
fns = []
fps = []
for block in range(NSETS):
# Skip, e.g.,
# Training on Data/Ham/Set1 & Data/Spam/Set1 ... 4000 hams & 2750 spams
f.readline()
for inner in range(NSETS - 1):
# A line with an f-p rate and an f-n rate.
p, n = map(float, f.readline().split())
fps.append(p)
fns.append(n)
# "total false pos 8 0.04"
# "total false neg 249 1.81090909091"
fptot = int(f.readline().split()[-2])
fntot = int(f.readline().split()[-2])
return fps, fns, fptot, fntot
def dump(p1s, p2s):
alltags = ""
for p1, p2 in zip(p1s, p2s):
if p1 < p2:
tag = "lost"
elif p1 > p2:
tag = "won"
else:
tag = "tied"
print " %5.3f %5.3f %s" % (p1, p2, tag)
alltags += tag + " "
print
for tag in "won", "tied", "lost":
print "%-4s %2d %s" % (tag, alltags.count(tag), "times")
print
fp1, fn1, fptot1, fntot1 = suck(file(f1n + '.txt'))
fp2, fn2, fptot2, fntot2 = suck(file(f2n + '.txt'))
print f1n, '->', f2n
print
print "false positive percentages"
dump(fp1, fp2)
print "total unique fp went from", fptot1, "to", fptot2
print
print "false negative percentages"
dump(fn1, fn2)
print "total unique fn went from", fntot1, "to", fntot2
Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** README.txt 5 Sep 2002 23:34:41 -0000 1.2
--- README.txt 5 Sep 2002 23:42:52 -0000 1.3
***************
*** 44,47 ****
--- 44,52 ----
statistics.
+ cmp.py
+ Given two summary files produced by rates.py, displays an account
+ of all the f-p and f-n rates side-by-side, along with who won which
+ (etc), and the change in total # of f-ps and f-n.
+
Test Data Utilities
From tim_one@users.sourceforge.net Fri Sep 6 00:51:34 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 05 Sep 2002 16:51:34 -0700
Subject: [Spambayes-checkins] spambayes timtest.py,1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv13855
Modified Files:
timtest.py
Log Message:
Pure win for the f-n rate: take X-Mailer into account.
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.050 0.075 lost
0.000 0.000 tied
0.025 0.025 tied
0.025 0.025 tied
0.050 0.050 tied
0.025 0.025 tied
0.025 0.025 tied
0.050 0.050 tied
0.075 0.075 tied
0.025 0.025 tied
0.025 0.025 tied
0.025 0.025 tied
0.025 0.025 tied
0.025 0.025 tied
0.025 0.025 tied
0.000 0.000 tied
0.025 0.025 tied
0.050 0.050 tied
won 0 times
tied 19 times
lost 1 times
total unique fp went from 8 to 8
false negative percentages
0.691 0.582 won
0.655 0.618 won
0.945 0.836 won
1.309 1.236 won
1.164 1.018 won
0.800 0.764 won
0.763 0.691 won
1.163 1.054 won
1.345 1.236 won
1.127 1.018 won
1.345 1.236 won
1.490 1.418 won
0.909 0.764 won
0.582 0.473 won
0.691 0.509 won
1.163 0.945 won
1.018 0.945 won
0.873 0.727 won
0.909 0.764 won
1.127 0.981 won
won 20 times
tied 0 times
lost 0 times
total unique fn went from 249 to 226
Index: timtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtest.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** timtest.py 5 Sep 2002 16:16:43 -0000 1.1
--- timtest.py 5 Sep 2002 23:51:32 -0000 1.2
***************
*** 508,513 ****
# From:
# Reply-To:
! # X-Mailer:
! for field in ('from',):# 'reply-to', 'x-mailer',):
prefix = field + ':'
subj = msg.get(field, '-None-')
--- 508,512 ----
# From:
# Reply-To:
! for field in ('from',):# 'reply-to',):
prefix = field + ':'
subj = msg.get(field, '-None-')
***************
*** 515,518 ****
--- 514,526 ----
for t in tokenize_word(w):
yield prefix + t
+
+ # These headers seem to work best if they're not tokenized: just
+ # normalize case and whitespace.
+ # X-Mailer: This is a pure and significant win for the f-n rate; f-p
+ # rate isn't affected.
+ for field in ('x-mailer',):
+ prefix = field + ':'
+ subj = msg.get(field, '-None-')
+ yield prefix + ' '.join(subj.lower().split())
# Organization:
From tim_one@users.sourceforge.net Fri Sep 6 01:10:53 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 05 Sep 2002 17:10:53 -0700
Subject: [Spambayes-checkins] spambayes timtest.py,1.2,1.3
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv21652
Modified Files:
timtest.py
Log Message:
Added a note about why User-Agent is skipped.
Index: timtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtest.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** timtest.py 5 Sep 2002 23:51:32 -0000 1.2
--- timtest.py 6 Sep 2002 00:10:51 -0000 1.3
***************
*** 519,522 ****
--- 519,527 ----
# X-Mailer: This is a pure and significant win for the f-n rate; f-p
# rate isn't affected.
+ # User-Agent: Skipping it, as it made no difference. Very few spams
+ # had a User-Agent field, but lots of hams didn't either,
+ # and the spam probability of User-Agent was very close to
+ # 0.5 (== not a valuable discriminator) across all training
+ # sets.
for field in ('x-mailer',):
prefix = field + ':'
From tim_one@users.sourceforge.net Fri Sep 6 05:25:47 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 05 Sep 2002 21:25:47 -0700
Subject: [Spambayes-checkins] spambayes cmp.py,1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv29991
Modified Files:
cmp.py
Log Message:
Added a %-changed column.
Index: cmp.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/cmp.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** cmp.py 5 Sep 2002 23:42:52 -0000 1.1
--- cmp.py 6 Sep 2002 04:25:45 -0000 1.2
***************
*** 37,54 ****
return fps, fns, fptot, fntot
def dump(p1s, p2s):
alltags = ""
for p1, p2 in zip(p1s, p2s):
! if p1 < p2:
! tag = "lost"
! elif p1 > p2:
! tag = "won"
! else:
! tag = "tied"
! print " %5.3f %5.3f %s" % (p1, p2, tag)
! alltags += tag + " "
print
! for tag in "won", "tied", "lost":
! print "%-4s %2d %s" % (tag, alltags.count(tag), "times")
print
--- 37,61 ----
return fps, fns, fptot, fntot
+ def tag(p1, p2):
+ if p1 == p2:
+ t = "tied"
+ else:
+ t = p1 < p2 and "lost " or "won "
+ if p1:
+ p = (p2 - p1) * 100.0 / p1
+ t += " %+7.2f%%" % p
+ else:
+ t += " +(was 0)"
+ return t
+
def dump(p1s, p2s):
alltags = ""
for p1, p2 in zip(p1s, p2s):
! t = tag(p1, p2)
! print " %5.3f %5.3f %s" % (p1, p2, t)
! alltags += t + " "
print
! for t in "won", "tied", "lost":
! print "%-4s %2d %s" % (t, alltags.count(t), "times")
print
From tim_one@users.sourceforge.net Fri Sep 6 05:41:16 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 05 Sep 2002 21:41:16 -0700
Subject: [Spambayes-checkins] spambayes timtest.py,1.3,1.4
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv685
Modified Files:
timtest.py
Log Message:
Generated tokens for:
Content-Type
and its type= param
Content-Dispostion
and its filename= param
Content-Transfer-Encoding
all the charsets
This has huge benefit for the f-n rate, and virtually none on the f-p rate,
although it does reduce the variance of the f-p rate across different
training sets (really marginal msgs, like a brief HTML msg saying just
"unsubscribe me", are almost always tagged as spam now; before they were
right on the edge, and now the multipart/alternative pushes them over it
more consistently).
XXX I put all of this in as one chunk. I don't know which parts are
XXX most effective; it could be that some parts don't help at all. But
XXX given the nature of the c.l.py tests, it's not surprising that the
XXX 'content-type:text/html'
XXX token is now the single most powerful spam indicator (== makes it
XXX into the nbest list most often). What *is* a little surprising is
XXX that this doesn't push more mixed-type msgs into the f-p camp --
XXX unlike looking at *all* HTML tags, this is just one spam indicator
XXX instead of dozens, so relevant msg content can cancel it out.
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.075 0.100 lost +33.33%
0.000 0.000 tied
0.025 0.025 tied
0.025 0.025 tied
0.050 0.100 lost +100.00%
0.025 0.025 tied
0.025 0.025 tied
0.050 0.050 tied
0.075 0.100 lost +33.33%
0.025 0.025 tied
0.025 0.025 tied
0.025 0.025 tied
0.025 0.025 tied
0.025 0.025 tied
0.025 0.025 tied
0.000 0.000 tied
0.025 0.025 tied
0.050 0.100 lost +100.00%
won 0 times
tied 16 times
lost 4 times
total unique fp went from 8 to 9
false negative percentages
0.582 0.364 won -37.46%
0.618 0.400 won -35.28%
0.836 0.400 won -52.15%
1.236 0.909 won -26.46%
1.018 0.836 won -17.88%
0.764 0.618 won -19.11%
0.691 0.291 won -57.89%
1.054 1.018 won -3.42%
1.236 0.982 won -20.55%
1.018 0.727 won -28.59%
1.236 0.800 won -35.28%
1.418 1.163 won -17.98%
0.764 0.764 tied
0.473 0.473 tied
0.509 0.473 won -7.07%
0.945 0.727 won -23.07%
0.945 0.655 won -30.69%
0.727 0.509 won -29.99%
0.764 0.545 won -28.66%
0.981 0.509 won -48.11%
won 18 times
tied 2 times
lost 0 times
total unique fn went from 226 to 168
Index: timtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtest.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** timtest.py 6 Sep 2002 00:10:51 -0000 1.3
--- timtest.py 6 Sep 2002 04:41:13 -0000 1.4
***************
*** 477,480 ****
--- 477,531 ----
yield "skip:%c %d" % (word[0], n // 10 * 10)
+ # Generate tokens for:
+ # Content-Type
+ # and its type= param
+ # Content-Dispostion
+ # and its filename= param
+ # Content-Transfer-Encoding
+ # all the charsets
+ #
+ # This has huge benefit for the f-n rate, and virtually none on the f-p rate,
+ # although it does reduce the variance of the f-p rate across different
+ # training sets (really marginal msgs, like a brief HTML msg saying just
+ # "unsubscribe me", are almost always tagged as spam now; before they were
+ # right on the edge, and now the multipart/alternative pushes them over it
+ # more consistently).
+ #
+ # XXX I put all of this in as one chunk. I don't know which parts are
+ # XXX most effective; it could be that some parts don't help at all. But
+ # XXX given the nature of the c.l.py tests, it's not surprising that the
+ # XXX 'content-type:text/html'
+ # XXX token is now the single most powerful spam indicator (== makes it
+ # XXX into the nbest list most often). What *is* a little surprising is
+ # XXX that this doesn't push more mixed-type msgs into the f-p camp --
+ # XXX unlike looking at *all* HTML tags, this is just one spam indicator
+ # XXX instead of dozens, so relevant msg content can cancel it out.
+ def crack_content_xyz(msg):
+ x = msg.get_type()
+ if x is not None:
+ yield 'content-type:' + x.lower()
+
+ x = msg.get_param('type')
+ if x is not None:
+ yield 'content-type/type:' + x.lower()
+
+ for x in msg.get_charsets(None):
+ if x is not None:
+ yield 'charset:' + x.lower()
+
+ x = msg.get('content-disposition')
+ if x is not None:
+ yield 'content-disposition:' + x.lower()
+
+ fname = msg.get_filename()
+ if fname is not None:
+ for x in fname.lower().split('/'):
+ for y in x.split('.'):
+ yield 'filename:' + y
+
+ x = msg.get('content-transfer-encoding:')
+ if x is not None:
+ yield 'content-transfer-encoding:' + x.lower()
+
def tokenize(string):
# Create an email Message object.
***************
*** 493,502 ****
# XXX where "safe" is specific to my sorry corpora.
# Subject:
# Don't ignore case in Subject lines; e.g., 'free' versus 'FREE' is
# especially significant in this context. Experiment showed a small
# but real benefit to keeping case intact in this specific context.
! subj = msg.get('subject', '')
! for w in subject_word_re.findall(subj):
for t in tokenize_word(w):
yield 'subject:' + t
--- 544,560 ----
# XXX where "safe" is specific to my sorry corpora.
+ # Content-{Transfer-Encoding, Type, Disposition} and their params.
+ t = ''
+ for x in msg.walk():
+ for w in crack_content_xyz(x):
+ yield t + w
+ t = '>'
+
# Subject:
# Don't ignore case in Subject lines; e.g., 'free' versus 'FREE' is
# especially significant in this context. Experiment showed a small
# but real benefit to keeping case intact in this specific context.
! x = msg.get('subject', '')
! for w in subject_word_re.findall(x):
for t in tokenize_word(w):
yield 'subject:' + t
***************
*** 510,515 ****
for field in ('from',):# 'reply-to',):
prefix = field + ':'
! subj = msg.get(field, '-None-')
! for w in subj.lower().split():
for t in tokenize_word(w):
yield prefix + t
--- 568,573 ----
for field in ('from',):# 'reply-to',):
prefix = field + ':'
! x = msg.get(field, 'none').lower()
! for w in x.split():
for t in tokenize_word(w):
yield prefix + t
***************
*** 526,531 ****
for field in ('x-mailer',):
prefix = field + ':'
! subj = msg.get(field, '-None-')
! yield prefix + ' '.join(subj.lower().split())
# Organization:
--- 584,589 ----
for field in ('x-mailer',):
prefix = field + ':'
! x = msg.get(field, 'none').lower()
! yield prefix + ' '.join(x.split())
# Organization:
From tim_one@users.sourceforge.net Fri Sep 6 18:12:51 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Fri, 06 Sep 2002 10:12:51 -0700
Subject: [Spambayes-checkins] spambayes timtest.py,1.4,1.5
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv4345
Modified Files:
timtest.py
Log Message:
Included commented-out code for Anthony Baxter's mondo cool "count the
# of headers" idea.
Index: timtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtest.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** timtest.py 6 Sep 2002 04:41:13 -0000 1.4
--- timtest.py 6 Sep 2002 17:12:49 -0000 1.5
***************
*** 595,598 ****
--- 595,618 ----
yield "bool:noorg"
+ # XXX Following is a great idea due to Anthony Baxter. I can't use it
+ # XXX on my test data because the header lines are so different between
+ # XXX my ham and spam that it makes a large improvement for bogus
+ # XXX reasons. So it's commented out. But it's clearly a good thing
+ # XXX to do on "normal" data, and subsumes the Organization trick above
+ # XXX in a much more general way, yet at comparable cost.
+ ### X-UIDL:
+ ### Anthony Baxter's idea. This has spamprob 0.99! The value is clearly
+ ### irrelevant, just the presence or absence matters. However, it's
+ ### extremely rare in my spam sets, so doesn't have much value.
+ ###
+ ### As also suggested by Anthony, we can capture all such header oddities
+ ### just by generating tags for the count of how many times each header
+ ### field appears.
+ ##x2n = {}
+ ##for x in msg.keys():
+ ## x2n[x] = x2n.get(x, 0) + 1
+ ##for x in x2n.items():
+ ## yield "header:%s:%d" % x
+
# Find, decode (base64, qp), and tokenize the textual parts of the body.
for part in textparts(msg):
From tim_one@users.sourceforge.net Fri Sep 6 18:33:28 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Fri, 06 Sep 2002 10:33:28 -0700
Subject: [Spambayes-checkins]
spambayes timtoken.py,NONE,1.1 README.txt,1.3,1.4 timtest.py,1.5,1.6
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv10727
Modified Files:
README.txt timtest.py
Added Files:
timtoken.py
Log Message:
Split all knowledge of tokenization out of timtest.py and into a new
timtoken.py. You can use any tokenize() function you like now.
--- NEW FILE: timtoken.py ---
import re
import email
from email import message_from_string
from sets import Set
__all__ = ['tokenize']
# Find all the text components of the msg. There's no point decoding
# binary blobs (like images). If a multipart/alternative has both plain
# text and HTML versions of a msg, ignore the HTML part: HTML decorations
# have monster-high spam probabilities, and innocent newbies often post
# using HTML.
def textparts(msg):
text = Set()
redundant_html = Set()
for part in msg.walk():
if part.get_content_type() == 'multipart/alternative':
# Descend this part of the tree, adding any redundant HTML text
# part to redundant_html.
htmlpart = textpart = None
stack = part.get_payload()
while stack:
subpart = stack.pop()
ctype = subpart.get_content_type()
if ctype == 'text/plain':
textpart = subpart
elif ctype == 'text/html':
htmlpart = subpart
elif ctype == 'multipart/related':
stack.extend(subpart.get_payload())
if textpart is not None:
text.add(textpart)
if htmlpart is not None:
redundant_html.add(htmlpart)
elif htmlpart is not None:
text.add(htmlpart)
elif part.get_content_maintype() == 'text':
text.add(part)
return text - redundant_html
##############################################################################
# To fold case or not to fold case? I didn't want to fold case, because
# it hides information in English, and I have no idea what .lower() does
# to other languages; and, indeed, 'FREE' (all caps) turned out to be one
# of the strongest spam indicators in my content-only tests (== one with
# prob 0.99 *and* made it into spamprob's nbest list very often).
#
# Against preservering case, it makes the database size larger, and requires
# more training data to get enough "representative" mixed-case examples.
#
# Running my c.l.py tests didn't support my intuition that case was
# valuable, so it's getting folded away now. Folding or not made no
# significant difference to the false positive rate, and folding made a
# small (but statistically significant all the same) reduction in the
# false negative rate. There is one obvious difference: after folding
# case, conference announcements no longer got high spam scores. Their
# content was usually fine, but they were highly penalized for VISIT OUR
# WEBSITE FOR MORE INFORMATION! kinds of repeated SCREAMING. That is
# indeed the language of advertising, and I halfway regret that folding
# away case no longer picks on them.
#
# Since the f-p rate didn't change, but conference announcements escaped
# that category, something else took their place. It seems to be highly
# off-topic messages, like debates about Microsoft's place in the world.
# Talk about "money" and "lucrative" is indistinguishable now from talk
# about "MONEY" and "LUCRATIVE", and spam mentions MONEY a lot.
##############################################################################
# Character n-grams or words?
#
# With careful multiple-corpora c.l.py tests sticking to case-folded decoded
# text-only portions, and ignoring headers, and with identical special
# parsing & tagging of embedded URLs:
#
# Character 3-grams gave 5x as many false positives as split-on-whitespace
# (s-o-w). The f-n rate was also significantly worse, but within a factor
# of 2. So character 3-grams lost across the board.
#
# Character 5-grams gave 32% more f-ps than split-on-whitespace, but the
# s-o-w fp rate across 20,000 presumed-hams was 0.1%, and this is the
# difference between 23 and 34 f-ps. There aren't enough there to say that's
# significnatly more with killer-high confidence. There were plenty of f-ns,
# though, and the f-n rate with character 5-grams was substantially *worse*
# than with character 3-grams (which in turn was substantially worse than
# with s-o-w).
#
# Training on character 5-grams creates many more unique tokens than s-o-w:
# a typical run bloated to 150MB process size. It also ran a lot slower than
# s-o-w, partly related to heavy indexing of a huge out-of-cache wordinfo
# dict. I rarely noticed disk activity when running s-o-w, so rarely bothered
# to look at process size; it was under 30MB last time I looked.
#
# Figuring out *why* a msg scored as it did proved much more mysterious when
# working with character n-grams: they often had no obvious "meaning". In
# contrast, it was always easy to figure out what s-o-w was picking up on.
# 5-grams flagged a msg from Christian Tismer as spam, where he was discussing
# the speed of tasklets under his new implementation of stackless:
#
# prob = 0.99999998959
# prob('ed sw') = 0.01
# prob('http0:pgp') = 0.01
# prob('http0:python') = 0.01
# prob('hlon ') = 0.99
# prob('http0:wwwkeys') = 0.01
# prob('http0:starship') = 0.01
# prob('http0:stackless') = 0.01
# prob('n xp ') = 0.99
# prob('on xp') = 0.99
# prob('p 150') = 0.99
# prob('lon x') = 0.99
# prob(' amd ') = 0.99
# prob(' xp 1') = 0.99
# prob(' athl') = 0.99
# prob('1500+') = 0.99
# prob('xp 15') = 0.99
#
# The spam decision was baffling until I realized that *all* the high-
# probablity spam 5-grams there came out of a single phrase:
#
# AMD Athlon XP 1500+
#
# So Christian was punished for using a machine lots of spam tries to sell
# . In a classic Bayesian classifier, this probably wouldn't have
# mattered, but Graham's throws away almost all the 5-grams from a msg,
# saving only the about-a-dozen farthest from a neutral 0.5. So one bad
# phrase can kill you! This appears to happen very rarely, but happened
# more than once.
#
# The conclusion is that character n-grams have almost nothing to recommend
# them under Graham's scheme: harder to work with, slower, much larger
# database, worse results, and prone to rare mysterious disasters.
#
# There's one area they won hands-down: detecting spam in what I assume are
# Asian languages. The s-o-w scheme sometimes finds only line-ends to split
# on then, and then a "hey, this 'word' is way too big! let's ignore it"
# gimmick kicks in, and produces no tokens at all.
#
# [Later: we produce character 5-grams then under the s-o-w scheme, instead
# ignoring the blob, but only if there are high-bit characters in the blob;
# e.g., there's no point 5-gramming uuencoded lines, and doing so would
# bloat the database size.]
#
# Interesting: despite that odd example above, the *kinds* of f-p mistakes
# 5-grams made were very much like s-o-w made -- I recognized almost all of
# the 5-gram f-p messages from previous s-o-w runs. For example, both
# schemes have a particular hatred for conference announcements, although
# s-o-w stopped hating them after folding case. But 5-grams still hate them.
# Both schemes also hate msgs discussing HTML with examples, with about equal
# passion. Both schemes hate brief "please subscribe [unsubscribe] me"
# msgs, although 5-grams seems to hate them more.
##############################################################################
# How to tokenize?
#
# I started with string.split() merely for speed. Over time I realized it
# was making interesting context distinctions qualitatively akin to n-gram
# schemes; e.g., "free!!" is a much stronger spam indicator than "free". But
# unlike n-grams (whether word- or character- based) under Graham's scoring
# scheme, this mild context dependence never seems to go over the edge in
# giving "too much" credence to an unlucky phrase.
#
# OTOH, compared to "searching for words", it increases the size of the
# database substantially, less than but close to a factor of 2. This is very
# much less than a word bigram scheme bloats it, but as always an increase
# isn't justified unless the results are better.
#
# Following are stats comparing
#
# for token in text.split(): # left column
#
# to
#
# for token in re.findall(r"[\w$\-\x80-\xff]+", text): # right column
#
# text is case-normalized (text.lower()) in both cases, and the runs were
# identical in all other respects. The results clearly favor the split()
# gimmick, although they vaguely suggest that some sort of compromise
# may do as well with less database burden; e.g., *perhaps* folding runs of
# "punctuation" characters into a canonical representative could do that.
# But the database size is reasonable without that, and plain split() avoids
# having to worry about how to "fold punctuation" in languages other than
# English.
#
# false positive percentages
# 0.000 0.000 tied
# 0.000 0.050 lost
# 0.050 0.150 lost
# 0.000 0.025 lost
# 0.025 0.050 lost
# 0.025 0.075 lost
# 0.050 0.150 lost
# 0.025 0.000 won
# 0.025 0.075 lost
# 0.000 0.025 lost
# 0.075 0.150 lost
# 0.050 0.050 tied
# 0.025 0.050 lost
# 0.000 0.025 lost
# 0.050 0.025 won
# 0.025 0.000 won
# 0.025 0.025 tied
# 0.000 0.025 lost
# 0.025 0.075 lost
# 0.050 0.175 lost
#
# won 3 times
# tied 3 times
# lost 14 times
#
# total unique fp went from 8 to 20
#
# false negative percentages
# 0.945 1.200 lost
# 0.836 1.018 lost
# 1.200 1.200 tied
# 1.418 1.636 lost
# 1.455 1.418 won
# 1.091 1.309 lost
# 1.091 1.272 lost
# 1.236 1.563 lost
# 1.564 1.855 lost
# 1.236 1.491 lost
# 1.563 1.599 lost
# 1.563 1.781 lost
# 1.236 1.709 lost
# 0.836 0.982 lost
# 0.873 1.382 lost
# 1.236 1.527 lost
# 1.273 1.418 lost
# 1.018 1.273 lost
# 1.091 1.091 tied
# 1.490 1.454 won
#
# won 2 times
# tied 2 times
# lost 16 times
#
# total unique fn went from 292 to 302
##############################################################################
# What about HTML?
#
# Computer geeks seem to view use of HTML in mailing lists and newsgroups as
# a mortal sin. Normal people don't, but so it goes: in a technical list/
# group, every HTML decoration has spamprob 0.99, there are lots of unique
# HTML decorations, and lots of them appear at the very start of the message
# so that Graham's scoring scheme latches on to them tight. As a result,
# any plain text message just containing an HTML example is likely to be
# judged spam (every HTML decoration is an extreme).
#
# So if a message is multipart/alternative with both text/plain and text/html
# branches, we ignore the latter, else newbies would never get a message
# through. If a message is just HTML, it has virtually no chance of getting
# through.
#
# In an effort to let normal people use mailing lists too , and to
# alleviate the woes of messages merely *discussing* HTML practice, I
# added a gimmick to strip HTML tags after case-normalization and after
# special tagging of embedded URLs. This consisted of a regexp sub pattern,
# where instances got replaced by single blanks:
#
# html_re = re.compile(r"""
# <
# [^\s<>] # e.g., don't match 'a < b' or '<<<' or 'i << 5' or 'a<>b'
# [^>]{0,128} # search for the end '>', but don't chew up the world
# >
# """, re.VERBOSE)
#
# and then
#
# text = html_re.sub(' ', text)
#
# Alas, little good came of this:
#
# false positive percentages
# 0.000 0.000 tied
# 0.000 0.000 tied
# 0.050 0.075 lost
# 0.000 0.000 tied
# 0.025 0.025 tied
# 0.025 0.025 tied
# 0.050 0.050 tied
# 0.025 0.025 tied
# 0.025 0.025 tied
# 0.000 0.050 lost
# 0.075 0.100 lost
# 0.050 0.050 tied
# 0.025 0.025 tied
# 0.000 0.025 lost
# 0.050 0.050 tied
# 0.025 0.025 tied
# 0.025 0.025 tied
# 0.000 0.000 tied
# 0.025 0.050 lost
# 0.050 0.050 tied
#
# won 0 times
# tied 15 times
# lost 5 times
#
# total unique fp went from 8 to 12
#
# false negative percentages
# 0.945 1.164 lost
# 0.836 1.418 lost
# 1.200 1.272 lost
# 1.418 1.272 won
# 1.455 1.273 won
# 1.091 1.382 lost
# 1.091 1.309 lost
# 1.236 1.381 lost
# 1.564 1.745 lost
# 1.236 1.564 lost
# 1.563 1.781 lost
# 1.563 1.745 lost
# 1.236 1.455 lost
# 0.836 0.982 lost
# 0.873 1.309 lost
# 1.236 1.381 lost
# 1.273 1.273 tied
# 1.018 1.273 lost
# 1.091 1.200 lost
# 1.490 1.599 lost
#
# won 2 times
# tied 1 times
# lost 17 times
#
# total unique fn went from 292 to 327
#
# The messages merely discussing HTML were no longer fps, so it did what it
# intended there. But the f-n rate nearly doubled on at least one run -- so
# strong a set of spam indicators is the mere presence of HTML. The increase
# in the number of fps despite that the HTML-discussing msgs left that
# category remains mysterious to me, but it wasn't a significant increase
# so I let it drop.
#
# Later: If I simply give up on making mailing lists friendly to my sisters
# (they're not nerds, and create wonderfully attractive HTML msgs), a
# compromise is to strip HTML tags from only text/plain msgs. That's
# principled enough so far as it goes, and eliminates the HTML-discussing
# false positives. It remains disturbing that the f-n rate on pure HTML
# msgs increases significantly when stripping tags, so the code here doesn't
# do that part. However, even after stripping tags, the rates above show that
# at least 98% of spams are still correctly identified as spam.
# XXX So, if another way is found to slash the f-n rate, the decision here
# XXX not to strip HTML from HTML-only msgs should be revisited.
url_re = re.compile(r"""
(https? | ftp) # capture the protocol
:// # skip the boilerplate
# Do a reasonable attempt at detecting the end. It may or may not
# be in HTML, may or may not be in quotes, etc. If it's full of %
# escapes, cool -- that's a clue too.
([^\s<>'"\x7f-\xff]+) # capture the guts
""", re.VERBOSE)
urlsep_re = re.compile(r"[;?:@&=+,$.]")
has_highbit_char = re.compile(r"[\x80-\xff]").search
# Cheap-ass gimmick to probabilistically find HTML/XML tags.
html_re = re.compile(r"""
<
[^\s<>] # e.g., don't match 'a < b' or '<<<' or 'i << 5' or 'a<>b'
[^>]{0,128} # search for the end '>', but don't run wild
>
""", re.VERBOSE)
# I'm usually just splitting on whitespace, but for subject lines I want to
# break things like "Python/Perl comparison?" up. OTOH, I don't want to
# break up the unitized numbers in spammish subject phrases like "Increase
# size 79%" or "Now only $29.95!". Then again, I do want to break up
# "Python-Dev".
subject_word_re = re.compile(r"[\w\x80-\xff$.%]+")
def tokenize_word(word, _len=len):
n = _len(word)
# XXX How big should "a word" be?
# XXX I expect 12 is fine -- a test run boosting to 13 had no effect
# XXX on f-p rate, and did a little better or worse than 12 across
# XXX runs -- overall, no significant difference. It's only "common
# XXX sense" so far driving the exclusion of lengths 1 and 2.
# Make sure this range matches in tokenize().
if 3 <= n <= 12:
yield word
elif n >= 3:
# A long word.
# Don't want to skip embedded email addresses.
if n < 40 and '.' in word and word.count('@') == 1:
p1, p2 = word.split('@')
yield 'email name:' + p1
for piece in p2.split('.'):
yield 'email addr:' + piece
# If there are any high-bit chars,
# tokenize it as byte 5-grams.
# XXX This really won't work for high-bit languages -- the scoring
# XXX scheme throws almost everything away, and one bad phrase can
# XXX generate enough bad 5-grams to dominate the final score.
# XXX This also increases the database size substantially.
elif has_highbit_char(word):
for i in xrange(n-4):
yield "5g:" + word[i : i+5]
else:
# It's a long string of "normal" chars. Ignore it.
# For example, it may be an embedded URL (which we already
# tagged), or a uuencoded line.
# There's value in generating a token indicating roughly how
# many chars were skipped. This has real benefit for the f-n
# rate, but is neutral for the f-p rate. I don't know why!
# XXX Figure out why, and/or see if some other way of summarizing
# XXX this info has greater benefit.
yield "skip:%c %d" % (word[0], n // 10 * 10)
# Generate tokens for:
# Content-Type
# and its type= param
# Content-Dispostion
# and its filename= param
# Content-Transfer-Encoding
# all the charsets
#
# This has huge benefit for the f-n rate, and virtually none on the f-p rate,
# although it does reduce the variance of the f-p rate across different
# training sets (really marginal msgs, like a brief HTML msg saying just
# "unsubscribe me", are almost always tagged as spam now; before they were
# right on the edge, and now the multipart/alternative pushes them over it
# more consistently).
#
# XXX I put all of this in as one chunk. I don't know which parts are
# XXX most effective; it could be that some parts don't help at all. But
# XXX given the nature of the c.l.py tests, it's not surprising that the
# XXX 'content-type:text/html'
# XXX token is now the single most powerful spam indicator (== makes it
# XXX into the nbest list most often). What *is* a little surprising is
# XXX that this doesn't push more mixed-type msgs into the f-p camp --
# XXX unlike looking at *all* HTML tags, this is just one spam indicator
# XXX instead of dozens, so relevant msg content can cancel it out.
def crack_content_xyz(msg):
x = msg.get_type()
if x is not None:
yield 'content-type:' + x.lower()
x = msg.get_param('type')
if x is not None:
yield 'content-type/type:' + x.lower()
for x in msg.get_charsets(None):
if x is not None:
yield 'charset:' + x.lower()
x = msg.get('content-disposition')
if x is not None:
yield 'content-disposition:' + x.lower()
fname = msg.get_filename()
if fname is not None:
for x in fname.lower().split('/'):
for y in x.split('.'):
yield 'filename:' + y
x = msg.get('content-transfer-encoding:')
if x is not None:
yield 'content-transfer-encoding:' + x.lower()
def tokenize(string):
# Create an email Message object.
try:
msg = message_from_string(string)
except email.Errors.MessageParseError:
yield 'control: MessageParseError'
# XXX Fall back to the raw body text?
return
# Special tagging of header lines.
# XXX TODO Neil Schemenauer has gotten a good start on this (pvt email).
# XXX The headers in my spam and ham corpora are so different (they came
# XXX from different sources) that if I include them the classifier's
# XXX job is trivial. Only some "safe" header lines are included here,
# XXX where "safe" is specific to my sorry corpora.
# Content-{Transfer-Encoding, Type, Disposition} and their params.
t = ''
for x in msg.walk():
for w in crack_content_xyz(x):
yield t + w
t = '>'
# Subject:
# Don't ignore case in Subject lines; e.g., 'free' versus 'FREE' is
# especially significant in this context. Experiment showed a small
# but real benefit to keeping case intact in this specific context.
x = msg.get('subject', '')
for w in subject_word_re.findall(x):
for t in tokenize_word(w):
yield 'subject:' + t
# Dang -- I can't use Sender:. If I do,
# 'sender:email name:python-list-admin'
# becomes the most powerful indicator in the whole database.
#
# From:
# Reply-To:
for field in ('from',):# 'reply-to',):
prefix = field + ':'
x = msg.get(field, 'none').lower()
for w in x.split():
for t in tokenize_word(w):
yield prefix + t
# These headers seem to work best if they're not tokenized: just
# normalize case and whitespace.
# X-Mailer: This is a pure and significant win for the f-n rate; f-p
# rate isn't affected.
# User-Agent: Skipping it, as it made no difference. Very few spams
# had a User-Agent field, but lots of hams didn't either,
# and the spam probability of User-Agent was very close to
# 0.5 (== not a valuable discriminator) across all training
# sets.
for field in ('x-mailer',):
prefix = field + ':'
x = msg.get(field, 'none').lower()
yield prefix + ' '.join(x.split())
# Organization:
# Oddly enough, tokenizing this doesn't make any difference to results.
# However, noting its mere absence is strong enough to give a tiny
# improvement in the f-n rate, and since recording that requires only
# one token across the whole database, the cost is also tiny.
if msg.get('organization', None) is None:
yield "bool:noorg"
# XXX Following is a great idea due to Anthony Baxter. I can't use it
# XXX on my test data because the header lines are so different between
# XXX my ham and spam that it makes a large improvement for bogus
# XXX reasons. So it's commented out. But it's clearly a good thing
# XXX to do on "normal" data, and subsumes the Organization trick above
# XXX in a much more general way, yet at comparable cost.
### X-UIDL:
### Anthony Baxter's idea. This has spamprob 0.99! The value is clearly
### irrelevant, just the presence or absence matters. However, it's
### extremely rare in my spam sets, so doesn't have much value.
###
### As also suggested by Anthony, we can capture all such header oddities
### just by generating tags for the count of how many times each header
### field appears.
##x2n = {}
##for x in msg.keys():
## x2n[x] = x2n.get(x, 0) + 1
##for x in x2n.items():
## yield "header:%s:%d" % x
# Find, decode (base64, qp), and tokenize the textual parts of the body.
for part in textparts(msg):
# Decode, or take it as-is if decoding fails.
try:
text = part.get_payload(decode=True)
except:
yield "control: couldn't decode"
text = part.get_payload(decode=False)
if text is None:
yield 'control: payload is None'
continue
# Normalize case.
text = text.lower()
# Special tagging of embedded URLs.
for proto, guts in url_re.findall(text):
yield "proto:" + proto
# Lose the trailing punctuation for casual embedding, like:
# The code is at http://mystuff.org/here? Didn't resolve.
# or
# I found it at http://mystuff.org/there/. Thanks!
assert guts
while guts and guts[-1] in '.:?!/':
guts = guts[:-1]
for i, piece in enumerate(guts.split('/')):
prefix = "%s%s:" % (proto, i < 2 and str(i) or '>1')
for chunk in urlsep_re.split(piece):
yield prefix + chunk
# Remove HTML/XML tags if it's a plain text message.
if part.get_content_type() == "text/plain":
text = html_re.sub(' ', text)
# Tokenize everything.
for w in text.split():
n = len(w)
# Make sure this range matches in tokenize_word().
if 3 <= n <= 12:
yield w
elif n >= 3:
for t in tokenize_word(w):
yield t
Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** README.txt 5 Sep 2002 23:42:52 -0000 1.3
--- README.txt 6 Sep 2002 17:33:25 -0000 1.4
***************
*** 27,34 ****
of false positives and false negatives.
timtest.py
! A concrete test driver and tokenizer that uses Tester and
! classifier (above). This assumes "a standard" test data setup
! (see below). Could stand massive refactoring.
GBayes.py
--- 27,39 ----
of false positives and false negatives.
+ timtoken.py
+ Am implementation of tokenize() that Tim can't seem to help but keep
+ working on .
+
timtest.py
! A concrete test driver that uses Tester and classifier (above). This
! assumes "a standard" test data setup (see below). Could stand massive
! refactoring. You need to fiddle a line near the top to import a
! tokenize() function of your choosing.
GBayes.py
Index: timtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtest.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** timtest.py 6 Sep 2002 17:12:49 -0000 1.5
--- timtest.py 6 Sep 2002 17:33:26 -0000 1.6
***************
*** 7,18 ****
import os
- import re
from sets import Set
- import email
- from email import message_from_string
import cPickle as pickle
import Tester
import classifier
class Hist:
--- 7,16 ----
import os
from sets import Set
import cPickle as pickle
import Tester
import classifier
+ from timtoken import tokenize
class Hist:
***************
*** 57,663 ****
print "Spam distribution for", tag
spam.display()
-
- # Find all the text components of the msg. There's no point decoding
- # binary blobs (like images). If a multipart/alternative has both plain
- # text and HTML versions of a msg, ignore the HTML part: HTML decorations
- # have monster-high spam probabilities, and innocent newbies often post
- # using HTML.
- def textparts(msg):
- text = Set()
- redundant_html = Set()
- for part in msg.walk():
- if part.get_content_type() == 'multipart/alternative':
- # Descend this part of the tree, adding any redundant HTML text
- # part to redundant_html.
- htmlpart = textpart = None
- stack = part.get_payload()
- while stack:
- subpart = stack.pop()
- ctype = subpart.get_content_type()
- if ctype == 'text/plain':
- textpart = subpart
- elif ctype == 'text/html':
- htmlpart = subpart
- elif ctype == 'multipart/related':
- stack.extend(subpart.get_payload())
-
- if textpart is not None:
- text.add(textpart)
- if htmlpart is not None:
- redundant_html.add(htmlpart)
- elif htmlpart is not None:
- text.add(htmlpart)
-
- elif part.get_content_maintype() == 'text':
- text.add(part)
-
- return text - redundant_html
-
- ##############################################################################
- # To fold case or not to fold case? I didn't want to fold case, because
- # it hides information in English, and I have no idea what .lower() does
- # to other languages; and, indeed, 'FREE' (all caps) turned out to be one
- # of the strongest spam indicators in my content-only tests (== one with
- # prob 0.99 *and* made it into spamprob's nbest list very often).
- #
- # Against preservering case, it makes the database size larger, and requires
- # more training data to get enough "representative" mixed-case examples.
- #
- # Running my c.l.py tests didn't support my intuition that case was
- # valuable, so it's getting folded away now. Folding or not made no
- # significant difference to the false positive rate, and folding made a
- # small (but statistically significant all the same) reduction in the
- # false negative rate. There is one obvious difference: after folding
- # case, conference announcements no longer got high spam scores. Their
- # content was usually fine, but they were highly penalized for VISIT OUR
- # WEBSITE FOR MORE INFORMATION! kinds of repeated SCREAMING. That is
- # indeed the language of advertising, and I halfway regret that folding
- # away case no longer picks on them.
- #
- # Since the f-p rate didn't change, but conference announcements escaped
- # that category, something else took their place. It seems to be highly
- # off-topic messages, like debates about Microsoft's place in the world.
- # Talk about "money" and "lucrative" is indistinguishable now from talk
- # about "MONEY" and "LUCRATIVE", and spam mentions MONEY a lot.
-
-
- ##############################################################################
- # Character n-grams or words?
- #
- # With careful multiple-corpora c.l.py tests sticking to case-folded decoded
- # text-only portions, and ignoring headers, and with identical special
- # parsing & tagging of embedded URLs:
- #
- # Character 3-grams gave 5x as many false positives as split-on-whitespace
- # (s-o-w). The f-n rate was also significantly worse, but within a factor
- # of 2. So character 3-grams lost across the board.
- #
- # Character 5-grams gave 32% more f-ps than split-on-whitespace, but the
- # s-o-w fp rate across 20,000 presumed-hams was 0.1%, and this is the
- # difference between 23 and 34 f-ps. There aren't enough there to say that's
- # significnatly more with killer-high confidence. There were plenty of f-ns,
- # though, and the f-n rate with character 5-grams was substantially *worse*
- # than with character 3-grams (which in turn was substantially worse than
- # with s-o-w).
- #
- # Training on character 5-grams creates many more unique tokens than s-o-w:
- # a typical run bloated to 150MB process size. It also ran a lot slower than
- # s-o-w, partly related to heavy indexing of a huge out-of-cache wordinfo
- # dict. I rarely noticed disk activity when running s-o-w, so rarely bothered
- # to look at process size; it was under 30MB last time I looked.
- #
- # Figuring out *why* a msg scored as it did proved much more mysterious when
- # working with character n-grams: they often had no obvious "meaning". In
- # contrast, it was always easy to figure out what s-o-w was picking up on.
- # 5-grams flagged a msg from Christian Tismer as spam, where he was discussing
- # the speed of tasklets under his new implementation of stackless:
- #
- # prob = 0.99999998959
- # prob('ed sw') = 0.01
- # prob('http0:pgp') = 0.01
- # prob('http0:python') = 0.01
- # prob('hlon ') = 0.99
- # prob('http0:wwwkeys') = 0.01
- # prob('http0:starship') = 0.01
- # prob('http0:stackless') = 0.01
- # prob('n xp ') = 0.99
- # prob('on xp') = 0.99
- # prob('p 150') = 0.99
- # prob('lon x') = 0.99
- # prob(' amd ') = 0.99
- # prob(' xp 1') = 0.99
- # prob(' athl') = 0.99
- # prob('1500+') = 0.99
- # prob('xp 15') = 0.99
- #
- # The spam decision was baffling until I realized that *all* the high-
- # probablity spam 5-grams there came out of a single phrase:
- #
- # AMD Athlon XP 1500+
- #
- # So Christian was punished for using a machine lots of spam tries to sell
- # . In a classic Bayesian classifier, this probably wouldn't have
- # mattered, but Graham's throws away almost all the 5-grams from a msg,
- # saving only the about-a-dozen farthest from a neutral 0.5. So one bad
- # phrase can kill you! This appears to happen very rarely, but happened
- # more than once.
- #
- # The conclusion is that character n-grams have almost nothing to recommend
- # them under Graham's scheme: harder to work with, slower, much larger
- # database, worse results, and prone to rare mysterious disasters.
- #
- # There's one area they won hands-down: detecting spam in what I assume are
- # Asian languages. The s-o-w scheme sometimes finds only line-ends to split
- # on then, and then a "hey, this 'word' is way too big! let's ignore it"
- # gimmick kicks in, and produces no tokens at all.
- #
- # [Later: we produce character 5-grams then under the s-o-w scheme, instead
- # ignoring the blob, but only if there are high-bit characters in the blob;
- # e.g., there's no point 5-gramming uuencoded lines, and doing so would
- # bloat the database size.]
- #
- # Interesting: despite that odd example above, the *kinds* of f-p mistakes
- # 5-grams made were very much like s-o-w made -- I recognized almost all of
- # the 5-gram f-p messages from previous s-o-w runs. For example, both
- # schemes have a particular hatred for conference announcements, although
- # s-o-w stopped hating them after folding case. But 5-grams still hate them.
- # Both schemes also hate msgs discussing HTML with examples, with about equal
- # passion. Both schemes hate brief "please subscribe [unsubscribe] me"
- # msgs, although 5-grams seems to hate them more.
-
-
- ##############################################################################
- # How to tokenize?
- #
- # I started with string.split() merely for speed. Over time I realized it
- # was making interesting context distinctions qualitatively akin to n-gram
- # schemes; e.g., "free!!" is a much stronger spam indicator than "free". But
- # unlike n-grams (whether word- or character- based) under Graham's scoring
- # scheme, this mild context dependence never seems to go over the edge in
- # giving "too much" credence to an unlucky phrase.
- #
- # OTOH, compared to "searching for words", it increases the size of the
- # database substantially, less than but close to a factor of 2. This is very
- # much less than a word bigram scheme bloats it, but as always an increase
- # isn't justified unless the results are better.
- #
- # Following are stats comparing
- #
- # for token in text.split(): # left column
- #
- # to
- #
- # for token in re.findall(r"[\w$\-\x80-\xff]+", text): # right column
- #
- # text is case-normalized (text.lower()) in both cases, and the runs were
- # identical in all other respects. The results clearly favor the split()
- # gimmick, although they vaguely suggest that some sort of compromise
- # may do as well with less database burden; e.g., *perhaps* folding runs of
- # "punctuation" characters into a canonical representative could do that.
- # But the database size is reasonable without that, and plain split() avoids
- # having to worry about how to "fold punctuation" in languages other than
- # English.
- #
- # false positive percentages
- # 0.000 0.000 tied
- # 0.000 0.050 lost
- # 0.050 0.150 lost
- # 0.000 0.025 lost
- # 0.025 0.050 lost
- # 0.025 0.075 lost
- # 0.050 0.150 lost
- # 0.025 0.000 won
- # 0.025 0.075 lost
- # 0.000 0.025 lost
- # 0.075 0.150 lost
- # 0.050 0.050 tied
- # 0.025 0.050 lost
- # 0.000 0.025 lost
- # 0.050 0.025 won
- # 0.025 0.000 won
- # 0.025 0.025 tied
- # 0.000 0.025 lost
- # 0.025 0.075 lost
- # 0.050 0.175 lost
- #
- # won 3 times
- # tied 3 times
- # lost 14 times
- #
- # total unique fp went from 8 to 20
- #
- # false negative percentages
- # 0.945 1.200 lost
- # 0.836 1.018 lost
- # 1.200 1.200 tied
- # 1.418 1.636 lost
- # 1.455 1.418 won
- # 1.091 1.309 lost
- # 1.091 1.272 lost
- # 1.236 1.563 lost
- # 1.564 1.855 lost
- # 1.236 1.491 lost
- # 1.563 1.599 lost
- # 1.563 1.781 lost
- # 1.236 1.709 lost
- # 0.836 0.982 lost
- # 0.873 1.382 lost
- # 1.236 1.527 lost
- # 1.273 1.418 lost
- # 1.018 1.273 lost
- # 1.091 1.091 tied
- # 1.490 1.454 won
- #
- # won 2 times
- # tied 2 times
- # lost 16 times
- #
- # total unique fn went from 292 to 302
-
-
- ##############################################################################
- # What about HTML?
- #
- # Computer geeks seem to view use of HTML in mailing lists and newsgroups as
- # a mortal sin. Normal people don't, but so it goes: in a technical list/
- # group, every HTML decoration has spamprob 0.99, there are lots of unique
- # HTML decorations, and lots of them appear at the very start of the message
- # so that Graham's scoring scheme latches on to them tight. As a result,
- # any plain text message just containing an HTML example is likely to be
- # judged spam (every HTML decoration is an extreme).
- #
- # So if a message is multipart/alternative with both text/plain and text/html
- # branches, we ignore the latter, else newbies would never get a message
- # through. If a message is just HTML, it has virtually no chance of getting
- # through.
- #
- # In an effort to let normal people use mailing lists too , and to
- # alleviate the woes of messages merely *discussing* HTML practice, I
- # added a gimmick to strip HTML tags after case-normalization and after
- # special tagging of embedded URLs. This consisted of a regexp sub pattern,
- # where instances got replaced by single blanks:
- #
- # html_re = re.compile(r"""
- # <
- # [^\s<>] # e.g., don't match 'a < b' or '<<<' or 'i << 5' or 'a<>b'
- # [^>]{0,128} # search for the end '>', but don't chew up the world
- # >
- # """, re.VERBOSE)
- #
- # and then
- #
- # text = html_re.sub(' ', text)
- #
- # Alas, little good came of this:
- #
- # false positive percentages
- # 0.000 0.000 tied
- # 0.000 0.000 tied
- # 0.050 0.075 lost
- # 0.000 0.000 tied
- # 0.025 0.025 tied
- # 0.025 0.025 tied
- # 0.050 0.050 tied
- # 0.025 0.025 tied
- # 0.025 0.025 tied
- # 0.000 0.050 lost
- # 0.075 0.100 lost
- # 0.050 0.050 tied
- # 0.025 0.025 tied
- # 0.000 0.025 lost
- # 0.050 0.050 tied
- # 0.025 0.025 tied
- # 0.025 0.025 tied
- # 0.000 0.000 tied
- # 0.025 0.050 lost
- # 0.050 0.050 tied
- #
- # won 0 times
- # tied 15 times
- # lost 5 times
- #
- # total unique fp went from 8 to 12
- #
- # false negative percentages
- # 0.945 1.164 lost
- # 0.836 1.418 lost
- # 1.200 1.272 lost
- # 1.418 1.272 won
- # 1.455 1.273 won
- # 1.091 1.382 lost
- # 1.091 1.309 lost
- # 1.236 1.381 lost
- # 1.564 1.745 lost
- # 1.236 1.564 lost
- # 1.563 1.781 lost
- # 1.563 1.745 lost
- # 1.236 1.455 lost
- # 0.836 0.982 lost
- # 0.873 1.309 lost
- # 1.236 1.381 lost
- # 1.273 1.273 tied
- # 1.018 1.273 lost
- # 1.091 1.200 lost
- # 1.490 1.599 lost
- #
- # won 2 times
- # tied 1 times
- # lost 17 times
- #
- # total unique fn went from 292 to 327
- #
- # The messages merely discussing HTML were no longer fps, so it did what it
- # intended there. But the f-n rate nearly doubled on at least one run -- so
- # strong a set of spam indicators is the mere presence of HTML. The increase
- # in the number of fps despite that the HTML-discussing msgs left that
- # category remains mysterious to me, but it wasn't a significant increase
- # so I let it drop.
- #
- # Later: If I simply give up on making mailing lists friendly to my sisters
- # (they're not nerds, and create wonderfully attractive HTML msgs), a
- # compromise is to strip HTML tags from only text/plain msgs. That's
- # principled enough so far as it goes, and eliminates the HTML-discussing
- # false positives. It remains disturbing that the f-n rate on pure HTML
- # msgs increases significantly when stripping tags, so the code here doesn't
- # do that part. However, even after stripping tags, the rates above show that
- # at least 98% of spams are still correctly identified as spam.
- # XXX So, if another way is found to slash the f-n rate, the decision here
- # XXX not to strip HTML from HTML-only msgs should be revisited.
-
- url_re = re.compile(r"""
- (https? | ftp) # capture the protocol
- :// # skip the boilerplate
- # Do a reasonable attempt at detecting the end. It may or may not
- # be in HTML, may or may not be in quotes, etc. If it's full of %
- # escapes, cool -- that's a clue too.
- ([^\s<>'"\x7f-\xff]+) # capture the guts
- """, re.VERBOSE)
-
- urlsep_re = re.compile(r"[;?:@&=+,$.]")
-
- has_highbit_char = re.compile(r"[\x80-\xff]").search
-
- # Cheap-ass gimmick to probabilistically find HTML/XML tags.
- html_re = re.compile(r"""
- <
- [^\s<>] # e.g., don't match 'a < b' or '<<<' or 'i << 5' or 'a<>b'
- [^>]{0,128} # search for the end '>', but don't run wild
- >
- """, re.VERBOSE)
-
- # I'm usually just splitting on whitespace, but for subject lines I want to
- # break things like "Python/Perl comparison?" up. OTOH, I don't want to
- # break up the unitized numbers in spammish subject phrases like "Increase
- # size 79%" or "Now only $29.95!". Then again, I do want to break up
- # "Python-Dev".
- subject_word_re = re.compile(r"[\w\x80-\xff$.%]+")
-
- def tokenize_word(word, _len=len):
- n = _len(word)
-
- # XXX How big should "a word" be?
- # XXX I expect 12 is fine -- a test run boosting to 13 had no effect
- # XXX on f-p rate, and did a little better or worse than 12 across
- # XXX runs -- overall, no significant difference. It's only "common
- # XXX sense" so far driving the exclusion of lengths 1 and 2.
-
- # Make sure this range matches in tokenize().
- if 3 <= n <= 12:
- yield word
-
- elif n >= 3:
- # A long word.
-
- # Don't want to skip embedded email addresses.
- if n < 40 and '.' in word and word.count('@') == 1:
- p1, p2 = word.split('@')
- yield 'email name:' + p1
- for piece in p2.split('.'):
- yield 'email addr:' + piece
-
- # If there are any high-bit chars,
- # tokenize it as byte 5-grams.
- # XXX This really won't work for high-bit languages -- the scoring
- # XXX scheme throws almost everything away, and one bad phrase can
- # XXX generate enough bad 5-grams to dominate the final score.
- # XXX This also increases the database size substantially.
- elif has_highbit_char(word):
- for i in xrange(n-4):
- yield "5g:" + word[i : i+5]
-
- else:
- # It's a long string of "normal" chars. Ignore it.
- # For example, it may be an embedded URL (which we already
- # tagged), or a uuencoded line.
- # There's value in generating a token indicating roughly how
- # many chars were skipped. This has real benefit for the f-n
- # rate, but is neutral for the f-p rate. I don't know why!
- # XXX Figure out why, and/or see if some other way of summarizing
- # XXX this info has greater benefit.
- yield "skip:%c %d" % (word[0], n // 10 * 10)
-
- # Generate tokens for:
- # Content-Type
- # and its type= param
- # Content-Dispostion
- # and its filename= param
- # Content-Transfer-Encoding
- # all the charsets
- #
- # This has huge benefit for the f-n rate, and virtually none on the f-p rate,
- # although it does reduce the variance of the f-p rate across different
- # training sets (really marginal msgs, like a brief HTML msg saying just
- # "unsubscribe me", are almost always tagged as spam now; before they were
- # right on the edge, and now the multipart/alternative pushes them over it
- # more consistently).
- #
- # XXX I put all of this in as one chunk. I don't know which parts are
- # XXX most effective; it could be that some parts don't help at all. But
- # XXX given the nature of the c.l.py tests, it's not surprising that the
- # XXX 'content-type:text/html'
- # XXX token is now the single most powerful spam indicator (== makes it
- # XXX into the nbest list most often). What *is* a little surprising is
- # XXX that this doesn't push more mixed-type msgs into the f-p camp --
- # XXX unlike looking at *all* HTML tags, this is just one spam indicator
- # XXX instead of dozens, so relevant msg content can cancel it out.
- def crack_content_xyz(msg):
- x = msg.get_type()
- if x is not None:
- yield 'content-type:' + x.lower()
-
- x = msg.get_param('type')
- if x is not None:
- yield 'content-type/type:' + x.lower()
-
- for x in msg.get_charsets(None):
- if x is not None:
- yield 'charset:' + x.lower()
-
- x = msg.get('content-disposition')
- if x is not None:
- yield 'content-disposition:' + x.lower()
-
- fname = msg.get_filename()
- if fname is not None:
- for x in fname.lower().split('/'):
- for y in x.split('.'):
- yield 'filename:' + y
-
- x = msg.get('content-transfer-encoding:')
- if x is not None:
- yield 'content-transfer-encoding:' + x.lower()
-
- def tokenize(string):
- # Create an email Message object.
- try:
- msg = message_from_string(string)
- except email.Errors.MessageParseError:
- yield 'control: MessageParseError'
- # XXX Fall back to the raw body text?
- return
-
- # Special tagging of header lines.
- # XXX TODO Neil Schemenauer has gotten a good start on this (pvt email).
- # XXX The headers in my spam and ham corpora are so different (they came
- # XXX from different sources) that if I include them the classifier's
- # XXX job is trivial. Only some "safe" header lines are included here,
- # XXX where "safe" is specific to my sorry corpora.
-
- # Content-{Transfer-Encoding, Type, Disposition} and their params.
- t = ''
- for x in msg.walk():
- for w in crack_content_xyz(x):
- yield t + w
- t = '>'
-
- # Subject:
- # Don't ignore case in Subject lines; e.g., 'free' versus 'FREE' is
- # especially significant in this context. Experiment showed a small
- # but real benefit to keeping case intact in this specific context.
- x = msg.get('subject', '')
- for w in subject_word_re.findall(x):
- for t in tokenize_word(w):
- yield 'subject:' + t
-
- # Dang -- I can't use Sender:. If I do,
- # 'sender:email name:python-list-admin'
- # becomes the most powerful indicator in the whole database.
- #
- # From:
- # Reply-To:
- for field in ('from',):# 'reply-to',):
- prefix = field + ':'
- x = msg.get(field, 'none').lower()
- for w in x.split():
- for t in tokenize_word(w):
- yield prefix + t
-
- # These headers seem to work best if they're not tokenized: just
- # normalize case and whitespace.
- # X-Mailer: This is a pure and significant win for the f-n rate; f-p
- # rate isn't affected.
- # User-Agent: Skipping it, as it made no difference. Very few spams
- # had a User-Agent field, but lots of hams didn't either,
- # and the spam probability of User-Agent was very close to
- # 0.5 (== not a valuable discriminator) across all training
- # sets.
- for field in ('x-mailer',):
- prefix = field + ':'
- x = msg.get(field, 'none').lower()
- yield prefix + ' '.join(x.split())
-
- # Organization:
- # Oddly enough, tokenizing this doesn't make any difference to results.
- # However, noting its mere absence is strong enough to give a tiny
- # improvement in the f-n rate, and since recording that requires only
- # one token across the whole database, the cost is also tiny.
- if msg.get('organization', None) is None:
- yield "bool:noorg"
-
- # XXX Following is a great idea due to Anthony Baxter. I can't use it
- # XXX on my test data because the header lines are so different between
- # XXX my ham and spam that it makes a large improvement for bogus
- # XXX reasons. So it's commented out. But it's clearly a good thing
- # XXX to do on "normal" data, and subsumes the Organization trick above
- # XXX in a much more general way, yet at comparable cost.
- ### X-UIDL:
- ### Anthony Baxter's idea. This has spamprob 0.99! The value is clearly
- ### irrelevant, just the presence or absence matters. However, it's
- ### extremely rare in my spam sets, so doesn't have much value.
- ###
- ### As also suggested by Anthony, we can capture all such header oddities
- ### just by generating tags for the count of how many times each header
- ### field appears.
- ##x2n = {}
- ##for x in msg.keys():
- ## x2n[x] = x2n.get(x, 0) + 1
- ##for x in x2n.items():
- ## yield "header:%s:%d" % x
-
- # Find, decode (base64, qp), and tokenize the textual parts of the body.
- for part in textparts(msg):
- # Decode, or take it as-is if decoding fails.
- try:
- text = part.get_payload(decode=True)
- except:
- yield "control: couldn't decode"
- text = part.get_payload(decode=False)
-
- if text is None:
- yield 'control: payload is None'
- continue
-
- # Normalize case.
- text = text.lower()
-
- # Special tagging of embedded URLs.
- for proto, guts in url_re.findall(text):
- yield "proto:" + proto
- # Lose the trailing punctuation for casual embedding, like:
- # The code is at http://mystuff.org/here? Didn't resolve.
- # or
- # I found it at http://mystuff.org/there/. Thanks!
- assert guts
- while guts and guts[-1] in '.:?!/':
- guts = guts[:-1]
- for i, piece in enumerate(guts.split('/')):
- prefix = "%s%s:" % (proto, i < 2 and str(i) or '>1')
- for chunk in urlsep_re.split(piece):
- yield prefix + chunk
-
- # Remove HTML/XML tags if it's a plain text message.
- if part.get_content_type() == "text/plain":
- text = html_re.sub(' ', text)
-
- # Tokenize everything.
- for w in text.split():
- n = len(w)
- # Make sure this range matches in tokenize_word().
- if 3 <= n <= 12:
- yield w
-
- elif n >= 3:
- for t in tokenize_word(w):
- yield t
class Msg(object):
--- 55,58 ----
From tim_one@users.sourceforge.net Fri Sep 6 20:13:02 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Fri, 06 Sep 2002 12:13:02 -0700
Subject: [Spambayes-checkins] spambayes timtest.py,1.6,1.7 timtoken.py,1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv11785
Modified Files:
timtest.py timtoken.py
Log Message:
crack_content_xyz(): A bug prevented Content-Transfer-Encoding from
getting picked up. Fixed the bug, and then experiment showed it didn't
help, so disabled the corrected code and added a comment block explaining
why it's disabled.
Index: timtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtest.py,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** timtest.py 6 Sep 2002 17:33:26 -0000 1.6
--- timtest.py 6 Sep 2002 19:12:59 -0000 1.7
***************
*** 103,109 ****
trained_spam_hist = Hist(nbuckets)
! #fp = file('w.pik', 'wb')
! #pickle.dump(c, fp, 1)
! #fp.close()
for sd2, hd2 in SPAMHAMDIRS:
--- 103,109 ----
trained_spam_hist = Hist(nbuckets)
! fp = file('w.pik', 'wb')
! pickle.dump(c, fp, 1)
! fp.close()
for sd2, hd2 in SPAMHAMDIRS:
Index: timtoken.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtoken.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** timtoken.py 6 Sep 2002 17:33:26 -0000 1.1
--- timtoken.py 6 Sep 2002 19:12:59 -0000 1.2
***************
*** 433,437 ****
# Content-Dispostion
# and its filename= param
- # Content-Transfer-Encoding
# all the charsets
#
--- 433,436 ----
***************
*** 452,455 ****
--- 451,516 ----
# XXX unlike looking at *all* HTML tags, this is just one spam indicator
# XXX instead of dozens, so relevant msg content can cancel it out.
+ #
+ # A bug in this code prevented Content-Transfer-Encoding from getting
+ # picked up. Fixing that bug showed that it didn't helpe, so the corrected
+ # code is disabled now (left column without Content-Transfer-Encoding,
+ # right column with it);
+ #
+ # false positive percentages
+ # 0.000 0.000 tied
+ # 0.000 0.000 tied
+ # 0.100 0.100 tied
+ # 0.000 0.000 tied
+ # 0.025 0.025 tied
+ # 0.025 0.025 tied
+ # 0.100 0.100 tied
+ # 0.025 0.025 tied
+ # 0.025 0.025 tied
+ # 0.050 0.050 tied
+ # 0.100 0.100 tied
+ # 0.025 0.025 tied
+ # 0.025 0.025 tied
+ # 0.025 0.025 tied
+ # 0.025 0.025 tied
+ # 0.025 0.025 tied
+ # 0.025 0.025 tied
+ # 0.000 0.025 lost +(was 0)
+ # 0.025 0.025 tied
+ # 0.100 0.100 tied
+ #
+ # won 0 times
+ # tied 19 times
+ # lost 1 times
+ #
+ # total unique fp went from 9 to 10
+ #
+ # false negative percentages
+ # 0.364 0.400 lost +9.89%
+ # 0.400 0.364 won -9.00%
+ # 0.400 0.436 lost +9.00%
+ # 0.909 0.872 won -4.07%
+ # 0.836 0.836 tied
+ # 0.618 0.618 tied
+ # 0.291 0.291 tied
+ # 1.018 0.981 won -3.63%
+ # 0.982 0.982 tied
+ # 0.727 0.727 tied
+ # 0.800 0.800 tied
+ # 1.163 1.127 won -3.10%
+ # 0.764 0.836 lost +9.42%
+ # 0.473 0.473 tied
+ # 0.473 0.618 lost +30.66%
+ # 0.727 0.763 lost +4.95%
+ # 0.655 0.618 won -5.65%
+ # 0.509 0.473 won -7.07%
+ # 0.545 0.582 lost +6.79%
+ # 0.509 0.509 tied
+ #
+ # won 6 times
+ # tied 8 times
+ # lost 6 times
+ #
+ # total unique fn went from 168 to 169
+
def crack_content_xyz(msg):
x = msg.get_type()
***************
*** 475,481 ****
yield 'filename:' + y
! x = msg.get('content-transfer-encoding:')
! if x is not None:
! yield 'content-transfer-encoding:' + x.lower()
def tokenize(string):
--- 536,543 ----
yield 'filename:' + y
! if 0: # disabled; see comment before function
! x = msg.get('content-transfer-encoding')
! if x is not None:
! yield 'content-transfer-encoding:' + x.lower()
def tokenize(string):
***************
*** 495,499 ****
# XXX where "safe" is specific to my sorry corpora.
! # Content-{Transfer-Encoding, Type, Disposition} and their params.
t = ''
for x in msg.walk():
--- 557,561 ----
# XXX where "safe" is specific to my sorry corpora.
! # Content-{Type, Disposition} and their params, and charsets.
t = ''
for x in msg.walk():
***************
*** 601,605 ****
text = html_re.sub(' ', text)
! # Tokenize everything.
for w in text.split():
n = len(w)
--- 663,667 ----
text = html_re.sub(' ', text)
! # Tokenize everything in the body.
for w in text.split():
n = len(w)
From jhylton@users.sourceforge.net Fri Sep 6 20:26:36 2002
From: jhylton@users.sourceforge.net (Jeremy Hylton)
Date: Fri, 06 Sep 2002 12:26:36 -0700
Subject: [Spambayes-checkins] spambayes mboxtest.py,NONE,1.1
timtest.py,1.7,1.8
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv16790
Modified Files:
timtest.py
Added Files:
mboxtest.py
Log Message:
Add a test driver that works with mboxes.
This is similar in spirit to timtest, but it works with any old kind
of mailbox recognized by the Python mailbox module.
One non-trivial difference from timtest: Rather than requiring that
the user split the mailbox into separate parts, it selects NSETS
different subsets of the mailbox to use for testing. It chooses an
arbitrary subset because my mailboxes are sorted by date, and I didn't
want to bias tests by choosing training data from a small period of
time.
The timtest module has grown a Driver() class that is intended to work
just like the drive() function, but with a bit more flexibility. The
jdrive() function might be able to replace drive(), but I can't test
it so I'm not going to replace it. Maybe Tim will try jdrive() and
report if it works correctly.
I didn't find the MsgStream() class useful outside of timtest, but
mailboxes are represented by the mbox class, which is an iterable
collection of Msg objects.
Renamed the path attribute of Msg to tag, since path doesn't make
sense with an mbox. The path was getting used as a human-readable tag
for messages, so I synthesized one for mbox messages.
--- NEW FILE: mboxtest.py ---
#! /usr/bin/env python
from timtoken import tokenize
from classifier import GrahamBayes
from Tester import Test
from timtest import Driver, Msg
import getopt
import mailbox
import random
from sets import Set
import sys
mbox_fmts = {"unix": mailbox.PortableUnixMailbox,
"mmdf": mailbox.MmdfMailbox,
"mh": mailbox.MHMailbox,
"qmail": mailbox.Maildir,
}
class MboxMsg(Msg):
def __init__(self, fp, path, index):
self.guts = fp.read()
self.tag = "%s:%s %s" % (path, index, subject(self.guts))
class mbox(object):
def __init__(self, path, indices=None):
self.path = path
self.indices = {}
self.key = ''
if indices is not None:
self.key = " %s" % indices[0]
for i in indices:
self.indices[i] = 1
def __repr__(self):
return "" % (self.path, self.key)
def __iter__(self):
# Use a simple factory that just produces a string.
mbox = mbox_fmts[FMT](open(self.path, "rb"),
lambda f: MboxMsg(f, self.path, i))
i = 0
while 1:
msg = mbox.next()
if msg is None:
return
i += 1
if self.indices.get(i-1) or not self.indices:
yield msg
def subject(buf):
buf = buf.lower()
i = buf.find('subject:')
j = buf.find("\n", i)
return buf[i:j]
def randindices(nelts, nresults):
L = range(nelts)
random.shuffle(L)
chunk = nelts / nresults
for i in range(nresults):
yield Set(L[:chunk])
del L[:chunk]
def sort(seq):
L = list(seq)
L.sort()
return L
def main(args):
global FMT
FMT = "unix"
NSETS = 5
SEED = 101
LIMIT = None
opts, args = getopt.getopt(args, "f:n:s:l:")
for k, v in opts:
if k == '-f':
FMT = v
if k == '-n':
NSETS = int(v)
if k == '-s':
SEED = int(v)
if k == '-l':
LIMIT = int(v)
ham, spam = args
random.seed(SEED)
nham = len(list(mbox(ham)))
nspam = len(list(mbox(spam)))
if LIMIT:
nham = min(nham, LIMIT)
nspam = min(nspam, LIMIT)
print "ham", ham, nham
print "spam", spam, nspam
testsets = []
for iham in randindices(nham, NSETS):
for ispam in randindices(nspam, NSETS):
testsets.append((sort(iham), sort(ispam)))
driver = Driver()
for iham, ispam in testsets:
driver.train(mbox(ham, iham), mbox(spam, ispam))
for ihtest, istest in testsets:
if (iham, ispam) == (ihtest, istest):
continue
driver.test(mbox(ham, ihtest), mbox(spam, istest))
driver.finish()
driver.alldone()
if __name__ == "__main__":
sys.exit(main(sys.argv[1:]))
Index: timtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtest.py,v
retrieving revision 1.7
retrieving revision 1.8
diff -C2 -d -r1.7 -r1.8
*** timtest.py 6 Sep 2002 19:12:59 -0000 1.7
--- timtest.py 6 Sep 2002 19:26:34 -0000 1.8
***************
*** 59,63 ****
def __init__(self, dir, name):
path = dir + "/" + name
! self.path = path
f = open(path, 'rb')
guts = f.read()
--- 59,63 ----
def __init__(self, dir, name):
path = dir + "/" + name
! self.tag = path
f = open(path, 'rb')
guts = f.read()
***************
*** 69,76 ****
def __hash__(self):
! return hash(self.path)
def __eq__(self, other):
! return self.path == other.path
class MsgStream(object):
--- 69,76 ----
def __hash__(self):
! return hash(self.tag)
def __eq__(self, other):
! return self.tag == other.tag
class MsgStream(object):
***************
*** 86,89 ****
--- 86,198 ----
return self.produce()
+ class Driver:
+
+ def __init__(self):
+ self.nbuckets = 40
+ self.falsepos = Set()
+ self.falseneg = Set()
+ self.global_ham_hist = Hist(self.nbuckets)
+ self.global_spam_hist = Hist(self.nbuckets)
+
+ def train(self, ham, spam):
+ self.classifier = classifier.GrahamBayes()
+ self.tester = Tester.Test(self.classifier)
+ print "Training on", ham, "&", spam, "..."
+ self.tester.train(ham, spam)
+
+ self.trained_ham_hist = Hist(self.nbuckets)
+ self.trained_spam_hist = Hist(self.nbuckets)
+
+ def finish(self):
+ printhist("all in this set:",
+ self.trained_ham_hist, self.trained_spam_hist)
+ self.global_ham_hist += self.trained_ham_hist
+ self.global_spam_hist += self.trained_spam_hist
+
+ def alldone(self):
+ printhist("all runs:", self.global_ham_hist, self.global_spam_hist)
+
+ def test(self, ham, spam):
+ c = self.classifier
+ t = self.tester
+ local_ham_hist = Hist(self.nbuckets)
+ local_spam_hist = Hist(self.nbuckets)
+
+ def new_ham(msg, prob):
+ local_ham_hist.add(prob)
+
+ def new_spam(msg, prob):
+ local_spam_hist.add(prob)
+ if prob < 0.1:
+ print
+ print "Low prob spam!", prob
+ print msg.tag
+ prob, clues = c.spamprob(msg, True)
+ for clue in clues:
+ print "prob(%r) = %g" % clue
+ print
+ print msg.guts
+
+ t.reset_test_results()
+ print " testing against", ham, "&", spam, "...",
+ t.predict(spam, True, new_spam)
+ t.predict(ham, False, new_ham)
+ print t.nham_tested, "hams &", t.nspam_tested, "spams"
+
+ print " false positive:", t.false_positive_rate()
+ print " false negative:", t.false_negative_rate()
+
+ newfpos = Set(t.false_positives()) - self.falsepos
+ self.falsepos |= newfpos
+ print " new false positives:", [e.tag for e in newfpos]
+ for e in newfpos:
+ print '*' * 78
+ print e.tag
+ prob, clues = c.spamprob(e, True)
+ print "prob =", prob
+ for clue in clues:
+ print "prob(%r) = %g" % clue
+ print
+ print e.guts
+
+ newfneg = Set(t.false_negatives()) - self.falseneg
+ self.falseneg |= newfneg
+ print " new false negatives:", [e.tag for e in newfneg]
+ for e in []:#newfneg:
+ print '*' * 78
+ print e.tag
+ prob, clues = c.spamprob(e, True)
+ print "prob =", prob
+ for clue in clues:
+ print "prob(%r) = %g" % clue
+ print
+ print e.guts[:1000]
+
+ print
+ print " best discriminators:"
+ stats = [(r.killcount, w) for w, r in c.wordinfo.iteritems()]
+ stats.sort()
+ del stats[:-30]
+ for count, w in stats:
+ r = c.wordinfo[w]
+ print " %r %d %g" % (w, r.killcount, r.spamprob)
+
+
+ printhist("this pair:", local_ham_hist, local_spam_hist)
+
+ self.trained_ham_hist += local_ham_hist
+ self.trained_spam_hist += local_spam_hist
+
+ def jdrive():
+ d = Driver()
+
+ for spamdir, hamdir in SPAMHAMDIRS:
+ d.train(MsgStream(hamdir), MsgStream(spamdir))
+ for sd2, hd2 in SPAMHAMDIRS:
+ if (sd2, hd2) == (spamdir, hamdir):
+ continue
+ d.test(MsgStream(hd2), MsgStream(sd2))
+ d.finish()
+ d.alldone()
def drive():
***************
*** 185,187 ****
printhist("all runs:", global_ham_hist, global_spam_hist)
! drive()
--- 294,297 ----
printhist("all runs:", global_ham_hist, global_spam_hist)
! if __name__ == "__main__":
! drive()
From rubiconx@users.sourceforge.net Fri Sep 6 20:29:58 2002
From: rubiconx@users.sourceforge.net (Neale Pickett)
Date: Fri, 06 Sep 2002 12:29:58 -0700
Subject: [Spambayes-checkins] spambayes hammie.py,NONE,1.1
classifier.py,1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv16269
Modified Files:
classifier.py
Added Files:
hammie.py
Log Message:
First stab at a procmail-ready application of all this great code.
The dbm method doesn't work for me yet, but you can use it as-is
with the pickle method, just invoke it from procmail with the -f
option.
I had to make a minor change to the classifier so it would write back
modified values to the database. I suppose I could have done this
by subclassing WordInfo's __setattr__ with a callback to the
containing PersistentGrahamBayes class, but this way is cleaner and
should incur no negligible penalty for the original GrahamBayes
class. I hope this is okay with Tim :^)
--- NEW FILE: hammie.py ---
#! /usr/bin/env python
# A driver for the classifier module. Currently mostly a wrapper around
# existing stuff.
"""Usage: %(program)s [options]
Where:
-h
show usage and exit
-g PATH
mbox or directory of known good messages (non-spam)
-s PATH
mbox or directory of known spam messages
-p FILE
use file as the persistent store. loads data from this file if it
exists, and saves data to this file at the end. Default: hammie.db
-d
use the DBM store instead of cPickle. The file is larger and
creating it is slower, but checking against it is much faster,
especially for large word databases.
-f
run as a filter: read a single message from stdin, add an
X-Spam-Disposition header, and write it to stdout.
"""
import sys
import os
import stat
import getopt
import mailbox
import email
import classifier
import errno
import anydbm
import cPickle as pickle
program = sys.argv[0]
# Tim's tokenizer kicks far more booty than anything I would have
# written. Score one for analysis ;)
from timtoken import tokenize
class DBDict:
"""Database Dictionary
This wraps an anydbm to make it look even more like a dictionary.
Call it with the name of your database file. Optionally, you can
specify a list of keys to skip when iterating. This only affects
iterators; things like .keys() still list everything. For instance:
>>> d = DBDict('/tmp/goober.db', ('skipme', 'skipmetoo'))
>>> d['skipme'] = 'booga'
>>> d['countme'] = 'wakka'
>>> print d.keys()
['skipme', 'countme']
>>> for k in d.iterkeys():
... print k
countme
"""
def __init__(self, dbname, iterskip=()):
self.hash = anydbm.open(dbname, 'c')
self.iterskip = iterskip
def __getitem__(self, key):
if self.hash.has_key(key):
return pickle.loads(self.hash[key])
else:
raise KeyError(key)
def __setitem__(self, key, val):
v = pickle.dumps(val, 1)
self.hash[key] = v
def __delitem__(self, key, val):
del(self.hash[key])
def __iter__(self, fn=None):
k = self.hash.first()
while k != None:
key = k[0]
val = pickle.loads(k[1])
if key not in self.iterskip:
if fn:
yield fn((key, val))
else:
yield (key, val)
try:
k = self.hash.next()
except KeyError:
break
def __contains__(self, name):
return self.has_key(name)
def __getattr__(self, name):
# Pass the buck
return getattr(self.hash, name)
def get(self, key, dfl=None):
if self.has_key(key):
return self[key]
else:
return dfl
def iteritems(self):
return self.__iter__()
def iterkeys(self):
return self.__iter__(lambda k: k[0])
def itervalues(self):
return self.__iter__(lambda k: k[1])
class PersistentGrahamBayes(classifier.GrahamBayes):
"""A persistent GrahamBayes classifier
This is just like classifier.GrahamBayes, except that the dictionary
is a database. You take less disk this way, I think, and you can
pretend it's persistent. It's much slower training, but much faster
checking, and takes less memory all around.
On destruction, an instantiation of this class will write it's state
to a special key. When you instantiate a new one, it will attempt
to read these values out of that key again, so you can pick up where
you left off.
"""
# XXX: Would it be even faster to remember (in a list) which keys
# had been modified, and only recalculate those keys? No sense in
# going over the entire word database if only 100 words are
# affected.
# XXX: Another idea: cache stuff in memory. But by then maybe we
# should just use ZODB.
def __init__(self, dbname):
classifier.GrahamBayes.__init__(self)
self.statekey = "saved state"
self.wordinfo = DBDict(dbname, (self.statekey,))
self.restore_state()
def __del__(self):
#super.__del__(self)
self.save_state()
def save_state(self):
self.wordinfo[self.statekey] = (self.nham, self.nspam)
def restore_state(self):
if self.wordinfo.has_key(self.statekey):
self.nham, self.nspam = self.wordinfo[self.statekey]
def train(bayes, msgs, is_spam):
"""Train bayes with a message"""
def _factory(fp):
try:
return email.message_from_file(fp)
except email.Errors.MessageParseError:
return ''
if stat.S_ISDIR(os.stat(msgs)[stat.ST_MODE]):
mbox = mailbox.MHMailbox(msgs, _factory)
else:
fp = open(msgs)
mbox = mailbox.PortableUnixMailbox(fp, _factory)
i = 0
for msg in mbox:
i += 1
# XXX: Is the \r a Unixism? I seem to recall it working in DOS
# back in the day. Maybe it's a line-printer-ism ;)
sys.stdout.write("\r%6d" % i)
sys.stdout.flush()
bayes.learn(tokenize(str(msg)), is_spam, False)
print
def filter(bayes, input, output):
"""Filter (judge) a message"""
msg = email.message_from_file(input)
prob, clues = bayes.spamprob(tokenize(str(msg)), True)
if prob < 0.9:
disp = "No"
else:
disp = "Yes"
disp += "; %.2f" % prob
disp += "; " + "; ".join(map(lambda x: "%s: %.2f" % (`x[0]`, x[1]), clues))
msg.add_header("X-Spam-Disposition", disp)
output.write(str(msg))
def usage(code, msg=''):
if msg:
print >> sys.stderr, msg
print >> sys.stderr
print >> sys.stderr, __doc__ % globals()
sys.exit(code)
def main():
try:
opts, args = getopt.getopt(sys.argv[1:], 'hdfg:s:p:')
except getopt.error, msg:
usage(1, msg)
if not opts:
usage(0, "No options given")
pck = "hammie.db"
good = spam = None
do_filter = usedb = False
for opt, arg in opts:
if opt == '-h':
usage(0)
elif opt == '-g':
good = arg
elif opt == '-s':
spam = arg
elif opt == '-p':
pck = arg
elif opt == "-d":
usedb = True
elif opt == "-f":
do_filter = True
if args:
usage(1)
save = False
if usedb:
bayes = PersistentGrahamBayes(pck)
else:
bayes = None
try:
fp = open(pck, 'rb')
except IOError, e:
if e.errno <> errno.ENOENT: raise
else:
bayes = pickle.load(fp)
fp.close()
if bayes is None:
bayes = classifier.GrahamBayes()
if good:
print "Training ham:"
train(bayes, good, False)
save = True
if spam:
print "Training spam:"
train(bayes, spam, True)
save = True
if save:
bayes.update_probabilities()
if not usedb and pck:
fp = open(pck, 'wb')
pickle.dump(bayes, fp, 1)
fp.close()
if do_filter:
filter(bayes, sys.stdin, sys.stdout)
if __name__ == "__main__":
main()
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** classifier.py 5 Sep 2002 16:16:43 -0000 1.1
--- classifier.py 6 Sep 2002 19:29:56 -0000 1.2
***************
*** 473,477 ****
nham = float(self.nham or 1)
nspam = float(self.nspam or 1)
! for record in self.wordinfo.itervalues():
# Compute prob(msg is spam | msg contains word).
hamcount = HAMBIAS * record.hamcount
--- 473,477 ----
nham = float(self.nham or 1)
nspam = float(self.nspam or 1)
! for word,record in self.wordinfo.iteritems():
# Compute prob(msg is spam | msg contains word).
hamcount = HAMBIAS * record.hamcount
***************
*** 488,493 ****
elif prob > MAX_SPAMPROB:
prob = MAX_SPAMPROB
!
! record.spamprob = prob
if self.DEBUG:
--- 488,494 ----
elif prob > MAX_SPAMPROB:
prob = MAX_SPAMPROB
! if record.spamprob != prob:
! record.spamprob = prob
! self.wordinfo[word] = record
if self.DEBUG:
From tim.one@comcast.net Fri Sep 6 21:00:53 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 06 Sep 2002 16:00:53 -0400
Subject: [Spambayes-checkins] spambayes hammie.py,NONE,1.1
classifier.py,1.1,1.2
In-Reply-To:
Message-ID:
r.py
> Added Files:
> hammie.py
Please add a short blurb about new files to README.txt.
From tim.one@comcast.net Fri Sep 6 21:01:20 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 06 Sep 2002 16:01:20 -0400
Subject: [Spambayes-checkins] spambayes mboxtest.py,NONE,1.1
timtest.py,1.7,1.8
In-Reply-To:
Message-ID:
> Added Files:
> mboxtest.py
Please add a short blurb about new files to README.txt.
From jhylton@users.sourceforge.net Fri Sep 6 21:08:16 2002
From: jhylton@users.sourceforge.net (Jeremy Hylton)
Date: Fri, 06 Sep 2002 13:08:16 -0700
Subject: [Spambayes-checkins] spambayes README.txt,1.4,1.5
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv32684
Modified Files:
README.txt
Log Message:
Add entry for mboxtest.py.
Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** README.txt 6 Sep 2002 17:33:25 -0000 1.4
--- README.txt 6 Sep 2002 20:08:14 -0000 1.5
***************
*** 27,32 ****
of false positives and false negatives.
timtoken.py
! Am implementation of tokenize() that Tim can't seem to help but keep
working on .
--- 27,38 ----
of false positives and false negatives.
+ mboxtest.py
+ A concrete test driver like timtest.py (see below), but working
+ with a pair of mailbox files rather than the specialized timtest
+ setup. Note that the validity of results from mboxtest.py have
+ yet to be confirmed.
+
timtoken.py
! An implementation of tokenize() that Tim can't seem to help but keep
working on .
From gvanrossum@users.sourceforge.net Fri Sep 6 21:12:07 2002
From: gvanrossum@users.sourceforge.net (Guido van Rossum)
Date: Fri, 06 Sep 2002 13:12:07 -0700
Subject: [Spambayes-checkins] spambayes hammie.py,1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv1852
Modified Files:
hammie.py
Log Message:
Use os.path.isdir() to test for directory-ness.
Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** hammie.py 6 Sep 2002 19:29:56 -0000 1.1
--- hammie.py 6 Sep 2002 20:12:05 -0000 1.2
***************
*** 27,31 ****
import sys
import os
- import stat
import getopt
import mailbox
--- 27,30 ----
***************
*** 167,171 ****
return ''
! if stat.S_ISDIR(os.stat(msgs)[stat.ST_MODE]):
mbox = mailbox.MHMailbox(msgs, _factory)
else:
--- 166,170 ----
return ''
! if os.path.isdir(msgs):
mbox = mailbox.MHMailbox(msgs, _factory)
else:
From rubiconx@users.sourceforge.net Fri Sep 6 21:13:34 2002
From: rubiconx@users.sourceforge.net (Neale Pickett)
Date: Fri, 06 Sep 2002 13:13:34 -0700
Subject: [Spambayes-checkins] spambayes README.txt,1.5,1.6
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv1676
Modified Files:
README.txt
Log Message:
Add short blurb about hammie.py
Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** README.txt 6 Sep 2002 20:08:14 -0000 1.5
--- README.txt 6 Sep 2002 20:13:31 -0000 1.6
***************
*** 27,30 ****
--- 27,34 ----
of false positives and false negatives.
+ hammie.py
+ A spamassassin-like filter which uses timtoken (below) and
+ classifier (above). Needs to be made faster, especially for writes.
+
mboxtest.py
A concrete test driver like timtest.py (see below), but working
From gvanrossum@users.sourceforge.net Fri Sep 6 21:23:18 2002
From: gvanrossum@users.sourceforge.net (Guido van Rossum)
Date: Fri, 06 Sep 2002 13:23:18 -0700
Subject: [Spambayes-checkins] spambayes hammie.py,1.2,1.3
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv5367
Modified Files:
hammie.py
Log Message:
Add a hack to train directly on a mailbox full of .txt files, like
Bruce Guenter's spam archive at http://www.em.ca/~bruceg/spam/.
Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** hammie.py 6 Sep 2002 20:12:05 -0000 1.2
--- hammie.py 6 Sep 2002 20:23:16 -0000 1.3
***************
*** 25,32 ****
--- 25,35 ----
"""
+ from __future__ import generators
+
import sys
import os
import getopt
import mailbox
+ import glob
import email
import classifier
***************
*** 158,161 ****
--- 161,182 ----
+ class DirOfTxtFileMailbox:
+
+ """Mailbox directory consisting of .txt files."""
+
+ def __init__(self, dirname, factory):
+ self.names = glob.glob(os.path.join(dirname, "*.txt"))
+ self.factory = factory
+
+ def __iter__(self):
+ for name in self.names:
+ try:
+ f = open(name)
+ except IOError:
+ continue
+ yield self.factory(f)
+ f.close()
+
+
def train(bayes, msgs, is_spam):
"""Train bayes with a message"""
***************
*** 167,171 ****
if os.path.isdir(msgs):
! mbox = mailbox.MHMailbox(msgs, _factory)
else:
fp = open(msgs)
--- 188,197 ----
if os.path.isdir(msgs):
! # XXX This is bogus: use an MHMailbox if the pathname contains /Mail/
! # XXX Should really use '+foo' MH folder styles. Later.
! if msgs.find("/Mail/") >= 0:
! mbox = mailbox.MHMailbox(msgs, _factory)
! else:
! mbox = DirOfTxtFileMailbox(msgs, _factory)
else:
fp = open(msgs)
From gvanrossum@users.sourceforge.net Fri Sep 6 21:42:47 2002
From: gvanrossum@users.sourceforge.net (Guido van Rossum)
Date: Fri, 06 Sep 2002 13:42:47 -0700
Subject: [Spambayes-checkins] spambayes hammie.py,1.3,1.4
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv11926
Modified Files:
hammie.py
Log Message:
train(): recognize '+foo' as the name of MH folder 'foo'.
Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** hammie.py 6 Sep 2002 20:23:16 -0000 1.3
--- hammie.py 6 Sep 2002 20:42:44 -0000 1.4
***************
*** 187,191 ****
return ''
! if os.path.isdir(msgs):
# XXX This is bogus: use an MHMailbox if the pathname contains /Mail/
# XXX Should really use '+foo' MH folder styles. Later.
--- 187,195 ----
return ''
! if msgs.startswith("+"):
! import mhlib
! mh = mhlib.MH()
! mbox = mailbox.MHMailbox(os.path.join(mh.getpath(), msgs[1:]))
! elif os.path.isdir(msgs):
# XXX This is bogus: use an MHMailbox if the pathname contains /Mail/
# XXX Should really use '+foo' MH folder styles. Later.
From tim_one@users.sourceforge.net Fri Sep 6 21:42:42 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Fri, 06 Sep 2002 13:42:42 -0700
Subject: [Spambayes-checkins] spambayes timtoken.py,1.2,1.3
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv11651
Modified Files:
timtoken.py
Log Message:
Added a note about an experiment with no lower limit on the length of
words we'll look at. Didn't matter to f-p, but hurt f-n.
Index: timtoken.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtoken.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** timtoken.py 6 Sep 2002 19:12:59 -0000 1.2
--- timtoken.py 6 Sep 2002 20:42:40 -0000 1.3
***************
*** 392,395 ****
--- 392,397 ----
# XXX runs -- overall, no significant difference. It's only "common
# XXX sense" so far driving the exclusion of lengths 1 and 2.
+ # XXX Later: A test with no lower bound showed a significant increase
+ # XXX in the f-n rate. Curious!
# Make sure this range matches in tokenize().
From gvanrossum@users.sourceforge.net Fri Sep 6 21:48:32 2002
From: gvanrossum@users.sourceforge.net (Guido van Rossum)
Date: Fri, 06 Sep 2002 13:48:32 -0700
Subject: [Spambayes-checkins] spambayes hammie.py,1.4,1.5
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv13867
Modified Files:
hammie.py
Log Message:
Fix comments.
Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** hammie.py 6 Sep 2002 20:42:44 -0000 1.4
--- hammie.py 6 Sep 2002 20:48:29 -0000 1.5
***************
*** 180,184 ****
def train(bayes, msgs, is_spam):
! """Train bayes with a message"""
def _factory(fp):
try:
--- 180,184 ----
def train(bayes, msgs, is_spam):
! """Train bayes with all messages from a mailbox."""
def _factory(fp):
try:
***************
*** 192,197 ****
mbox = mailbox.MHMailbox(os.path.join(mh.getpath(), msgs[1:]))
elif os.path.isdir(msgs):
! # XXX This is bogus: use an MHMailbox if the pathname contains /Mail/
! # XXX Should really use '+foo' MH folder styles. Later.
if msgs.find("/Mail/") >= 0:
mbox = mailbox.MHMailbox(msgs, _factory)
--- 192,197 ----
mbox = mailbox.MHMailbox(os.path.join(mh.getpath(), msgs[1:]))
elif os.path.isdir(msgs):
! # XXX Bogus: use an MHMailbox if the pathname contains /Mail/,
! # else a DirOfTxtFileMailbox.
if msgs.find("/Mail/") >= 0:
mbox = mailbox.MHMailbox(msgs, _factory)
From tim_one@users.sourceforge.net Fri Sep 6 23:47:50 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Fri, 06 Sep 2002 15:47:50 -0700
Subject: [Spambayes-checkins] spambayes timtest.py,1.8,1.9
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv19153
Modified Files:
timtest.py
Log Message:
Moved this along toward being more pluggable. Nuked the drive() function
and renamed Jeremy's jdrive() to drive(). Factored out code for
displaying a msg. Repaired some output so that rates.py can find the
output it's looking for. Sped the determination of the best
discriminators via using an nbest heap instead of materializing the
whole wordinfo dict into a list and sorting it.
Index: timtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtest.py,v
retrieving revision 1.8
retrieving revision 1.9
diff -C2 -d -r1.8 -r1.9
*** timtest.py 6 Sep 2002 19:26:34 -0000 1.8
--- timtest.py 6 Sep 2002 22:47:48 -0000 1.9
***************
*** 9,12 ****
--- 9,13 ----
from sets import Set
import cPickle as pickle
+ from heapq import heapreplace
import Tester
***************
*** 56,59 ****
--- 57,71 ----
spam.display()
+ def printmsg(msg, prob, clues, charlimit=None):
+ print msg.tag
+ print "prob =", prob
+ for clue in clues:
+ print "prob(%r) = %g" % clue
+ print
+ guts = msg.guts
+ if charlimit is not None:
+ guts = guts[:charlimit]
+ print guts
+
class Msg(object):
def __init__(self, dir, name):
***************
*** 78,81 ****
--- 90,96 ----
self.directory = directory
+ def __str__(self):
+ return self.directory
+
def produce(self):
directory = self.directory
***************
*** 86,93 ****
return self.produce()
class Driver:
! def __init__(self):
! self.nbuckets = 40
self.falsepos = Set()
self.falseneg = Set()
--- 101,116 ----
return self.produce()
+
+ # Loop:
+ # train() # on ham and spam
+ # Loop:
+ # test() # on presumably new ham and spam
+ # finishtest() # display stats against all runs on training set
+ # alldone() # display stats against all runs
+
class Driver:
! def __init__(self, nbuckets=40):
! self.nbuckets = nbuckets
self.falsepos = Set()
self.falseneg = Set()
***************
*** 97,109 ****
def train(self, ham, spam):
self.classifier = classifier.GrahamBayes()
! self.tester = Tester.Test(self.classifier)
! print "Training on", ham, "&", spam, "..."
! self.tester.train(ham, spam)
self.trained_ham_hist = Hist(self.nbuckets)
self.trained_spam_hist = Hist(self.nbuckets)
! def finish(self):
! printhist("all in this set:",
self.trained_ham_hist, self.trained_spam_hist)
self.global_ham_hist += self.trained_ham_hist
--- 120,134 ----
def train(self, ham, spam):
self.classifier = classifier.GrahamBayes()
! t = self.tester = Tester.Test(self.classifier)
!
! print "Training on", ham, "&", spam, "...",
! t.train(ham, spam)
! print t.nham, "hams &", t.nspam, "spams"
self.trained_ham_hist = Hist(self.nbuckets)
self.trained_spam_hist = Hist(self.nbuckets)
! def finishtest(self):
! printhist("all in this training set:",
self.trained_ham_hist, self.trained_spam_hist)
self.global_ham_hist += self.trained_ham_hist
***************
*** 127,136 ****
print
print "Low prob spam!", prob
- print msg.tag
prob, clues = c.spamprob(msg, True)
! for clue in clues:
! print "prob(%r) = %g" % clue
! print
! print msg.guts
t.reset_test_results()
--- 152,157 ----
print
print "Low prob spam!", prob
prob, clues = c.spamprob(msg, True)
! printmsg(msg, prob, clues)
t.reset_test_results()
***************
*** 148,158 ****
for e in newfpos:
print '*' * 78
- print e.tag
prob, clues = c.spamprob(e, True)
! print "prob =", prob
! for clue in clues:
! print "prob(%r) = %g" % clue
! print
! print e.guts
newfneg = Set(t.false_negatives()) - self.falseneg
--- 169,174 ----
for e in newfpos:
print '*' * 78
prob, clues = c.spamprob(e, True)
! printmsg(e, prob, clues)
newfneg = Set(t.false_negatives()) - self.falseneg
***************
*** 161,188 ****
for e in []:#newfneg:
print '*' * 78
- print e.tag
prob, clues = c.spamprob(e, True)
! print "prob =", prob
! for clue in clues:
! print "prob(%r) = %g" % clue
! print
! print e.guts[:1000]
print
print " best discriminators:"
! stats = [(r.killcount, w) for w, r in c.wordinfo.iteritems()]
stats.sort()
- del stats[:-30]
for count, w in stats:
r = c.wordinfo[w]
print " %r %d %g" % (w, r.killcount, r.spamprob)
-
printhist("this pair:", local_ham_hist, local_spam_hist)
-
self.trained_ham_hist += local_ham_hist
self.trained_spam_hist += local_spam_hist
! def jdrive():
d = Driver()
--- 177,203 ----
for e in []:#newfneg:
print '*' * 78
prob, clues = c.spamprob(e, True)
! printmsg(e, prob, clues, 1000)
print
print " best discriminators:"
! stats = [(-1, None) for i in range(30)]
! smallest_killcount = -1
! for w, r in c.wordinfo.iteritems():
! if r.killcount > smallest_killcount:
! heapreplace(stats, (r.killcount, w))
! smallest_killcount = stats[0][0]
stats.sort()
for count, w in stats:
+ if count < 0:
+ continue
r = c.wordinfo[w]
print " %r %d %g" % (w, r.killcount, r.spamprob)
printhist("this pair:", local_ham_hist, local_spam_hist)
self.trained_ham_hist += local_ham_hist
self.trained_spam_hist += local_spam_hist
! def drive():
d = Driver()
***************
*** 193,296 ****
continue
d.test(MsgStream(hd2), MsgStream(sd2))
! d.finish()
d.alldone()
-
- def drive():
- nbuckets = 40
- falsepos = Set()
- falseneg = Set()
- global_ham_hist = Hist(nbuckets)
- global_spam_hist = Hist(nbuckets)
- for spamdir, hamdir in SPAMHAMDIRS:
- c = classifier.GrahamBayes()
- t = Tester.Test(c)
- print "Training on", hamdir, "&", spamdir, "...",
- t.train(MsgStream(hamdir), MsgStream(spamdir))
- print t.nham, "hams &", t.nspam, "spams"
-
- trained_ham_hist = Hist(nbuckets)
- trained_spam_hist = Hist(nbuckets)
-
- fp = file('w.pik', 'wb')
- pickle.dump(c, fp, 1)
- fp.close()
-
- for sd2, hd2 in SPAMHAMDIRS:
- if (sd2, hd2) == (spamdir, hamdir):
- continue
-
- local_ham_hist = Hist(nbuckets)
- local_spam_hist = Hist(nbuckets)
-
- def new_ham(msg, prob):
- local_ham_hist.add(prob)
-
- def new_spam(msg, prob):
- local_spam_hist.add(prob)
- if prob < 0.1:
- print
- print "Low prob spam!", prob
- print msg.path
- prob, clues = c.spamprob(msg, True)
- for clue in clues:
- print "prob(%r) = %g" % clue
- print
- print msg.guts
-
- t.reset_test_results()
- print " testing against", hd2, "&", sd2, "...",
- t.predict(MsgStream(sd2), True, new_spam)
- t.predict(MsgStream(hd2), False, new_ham)
- print t.nham_tested, "hams &", t.nspam_tested, "spams"
-
- print " false positive:", t.false_positive_rate()
- print " false negative:", t.false_negative_rate()
-
- newfpos = Set(t.false_positives()) - falsepos
- falsepos |= newfpos
- print " new false positives:", [e.path for e in newfpos]
- for e in newfpos:
- print '*' * 78
- print e.path
- prob, clues = c.spamprob(e, True)
- print "prob =", prob
- for clue in clues:
- print "prob(%r) = %g" % clue
- print
- print e.guts
-
- newfneg = Set(t.false_negatives()) - falseneg
- falseneg |= newfneg
- print " new false negatives:", [e.path for e in newfneg]
- for e in []:#newfneg:
- print '*' * 78
- print e.path
- prob, clues = c.spamprob(e, True)
- print "prob =", prob
- for clue in clues:
- print "prob(%r) = %g" % clue
- print
- print e.guts[:1000]
-
- print
- print " best discriminators:"
- stats = [(r.killcount, w) for w, r in c.wordinfo.iteritems()]
- stats.sort()
- del stats[:-30]
- for count, w in stats:
- r = c.wordinfo[w]
- print " %r %d %g" % (w, r.killcount, r.spamprob)
-
-
- printhist("this pair:", local_ham_hist, local_spam_hist)
-
- trained_ham_hist += local_ham_hist
- trained_spam_hist += local_spam_hist
-
- printhist("all in this set:", trained_ham_hist, trained_spam_hist)
- global_ham_hist += trained_ham_hist
- global_spam_hist += trained_spam_hist
-
- printhist("all runs:", global_ham_hist, global_spam_hist)
if __name__ == "__main__":
--- 208,213 ----
continue
d.test(MsgStream(hd2), MsgStream(sd2))
! d.finishtest()
d.alldone()
if __name__ == "__main__":
From rubiconx@users.sourceforge.net Fri Sep 6 23:53:51 2002
From: rubiconx@users.sourceforge.net (Neale Pickett)
Date: Fri, 06 Sep 2002 15:53:51 -0700
Subject: [Spambayes-checkins] spambayes classifier.py,1.2,1.3
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv22277
Modified Files:
classifier.py
Log Message:
Another hack to get classifier to work with the database back-end.
This makes hammie work with the -d option.
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** classifier.py 6 Sep 2002 19:29:56 -0000 1.2
--- classifier.py 6 Sep 2002 22:53:49 -0000 1.3
***************
*** 538,541 ****
--- 538,542 ----
else:
record.hamcount += 1
+ wordinfo[word] = record
if self.DEBUG:
From tim_one@users.sourceforge.net Sat Sep 7 01:31:58 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Fri, 06 Sep 2002 17:31:58 -0700
Subject: [Spambayes-checkins] spambayes timtest.py,1.9,1.10
timtoken.py,1.3,1.4
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv13865
Modified Files:
timtest.py timtoken.py
Log Message:
Added note about boosting the lower limit on word length to 4.
Index: timtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtest.py,v
retrieving revision 1.9
retrieving revision 1.10
diff -C2 -d -r1.9 -r1.10
*** timtest.py 6 Sep 2002 22:47:48 -0000 1.9
--- timtest.py 7 Sep 2002 00:31:56 -0000 1.10
***************
*** 129,132 ****
--- 129,138 ----
self.trained_spam_hist = Hist(self.nbuckets)
+ #f = file('w.pik', 'wb')
+ #pickle.dump(self.classifier, f, 1)
+ #f.close()
+ #import sys
+ #sys.exit(0)
+
def finishtest(self):
printhist("all in this training set:",
Index: timtoken.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtoken.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** timtoken.py 6 Sep 2002 20:42:40 -0000 1.3
--- timtoken.py 7 Sep 2002 00:31:56 -0000 1.4
***************
*** 394,397 ****
--- 394,399 ----
# XXX Later: A test with no lower bound showed a significant increase
# XXX in the f-n rate. Curious!
+ # XXX Later: Boosting the lower bound to 4 is a Bad Idea too: f-p and
+ # XXX f-n rates both suffered then.
# Make sure this range matches in tokenize().
From tim_one@users.sourceforge.net Sat Sep 7 02:39:57 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Fri, 06 Sep 2002 18:39:57 -0700
Subject: [Spambayes-checkins] spambayes timtoken.py,1.4,1.5
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv27364
Modified Files:
timtoken.py
Log Message:
Comments about how long a word should be; the current values are the
best.
Index: timtoken.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtoken.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** timtoken.py 7 Sep 2002 00:31:56 -0000 1.4
--- timtoken.py 7 Sep 2002 01:39:55 -0000 1.5
***************
*** 356,359 ****
--- 356,379 ----
# XXX not to strip HTML from HTML-only msgs should be revisited.
+ ##############################################################################
+ # How big should "a word" be?
+ #
+ # As I write this, words less than 3 chars are ignored completely, and words
+ # with more than 12 are special-cased, replaced with a summary "I skipped
+ # about so-and-so many chars starting with such-and-such a letter" token.
+ # This makes sense for English if most of the info is in "regular size"
+ # words.
+ #
+ # A test run boosting to 13 had no effect on f-p rate, and did a little
+ # better or worse than 12 across runs -- overall, no significant difference.
+ # The database size is smaller at 12, so there's nothing in favor of 13.
+ # A test at 11 showed a slight but consistent bad effect on the f-n rate
+ # (lost 12 times, won once, tied 7 times).
+ #
+ # A test with no lower bound showed a significant increase in the f-n rate.
+ # Curious, but not worth digging into. Boosting the lower bound to 4 is a
+ # worse idea: f-p and f-n rates both suffered significantly then. I didn't
+ # try testing with lower bound 2.
+
url_re = re.compile(r"""
(https? | ftp) # capture the protocol
***************
*** 386,399 ****
def tokenize_word(word, _len=len):
n = _len(word)
-
- # XXX How big should "a word" be?
- # XXX I expect 12 is fine -- a test run boosting to 13 had no effect
- # XXX on f-p rate, and did a little better or worse than 12 across
- # XXX runs -- overall, no significant difference. It's only "common
- # XXX sense" so far driving the exclusion of lengths 1 and 2.
- # XXX Later: A test with no lower bound showed a significant increase
- # XXX in the f-n rate. Curious!
- # XXX Later: Boosting the lower bound to 4 is a Bad Idea too: f-p and
- # XXX f-n rates both suffered then.
# Make sure this range matches in tokenize().
--- 406,409 ----
From tim_one@users.sourceforge.net Sat Sep 7 02:41:30 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Fri, 06 Sep 2002 18:41:30 -0700
Subject: [Spambayes-checkins] spambayes timtoken.py,1.5,1.6
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv27775
Modified Files:
timtoken.py
Log Message:
Fixed typo in comment.
Index: timtoken.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtoken.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** timtoken.py 7 Sep 2002 01:39:55 -0000 1.5
--- timtoken.py 7 Sep 2002 01:41:28 -0000 1.6
***************
*** 467,471 ****
#
# A bug in this code prevented Content-Transfer-Encoding from getting
! # picked up. Fixing that bug showed that it didn't helpe, so the corrected
# code is disabled now (left column without Content-Transfer-Encoding,
# right column with it);
--- 467,471 ----
#
# A bug in this code prevented Content-Transfer-Encoding from getting
! # picked up. Fixing that bug showed that it didn't help, so the corrected
# code is disabled now (left column without Content-Transfer-Encoding,
# right column with it);
From gvanrossum@users.sourceforge.net Sat Sep 7 05:20:45 2002
From: gvanrossum@users.sourceforge.net (Guido van Rossum)
Date: Fri, 06 Sep 2002 21:20:45 -0700
Subject: [Spambayes-checkins] spambayes hammie.py,1.5,1.6
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv4871
Modified Files:
hammie.py
Log Message:
Fixed a bug in the opening of a folder given with "+foo" (wasn't using
_factory).
Add a -u option similar to that of GBayes.py. For this, factored the
opening of the mbox out of train() into a separate function getmbox(),
and the formatting of the clues out of filter().
(The -u option needs work; it currently doesn't report the message
number in a very useful way.)
Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** hammie.py 6 Sep 2002 20:48:29 -0000 1.5
--- hammie.py 7 Sep 2002 04:20:43 -0000 1.6
***************
*** 10,16 ****
show usage and exit
-g PATH
! mbox or directory of known good messages (non-spam)
-s PATH
! mbox or directory of known spam messages
-p FILE
use file as the persistent store. loads data from this file if it
--- 10,18 ----
show usage and exit
-g PATH
! mbox or directory of known good messages (non-spam) to train on.
-s PATH
! mbox or directory of known spam messages to train on.
! -u PATH
! mbox of unknown messages. A ham/spam decision is reported for each.
-p FILE
use file as the persistent store. loads data from this file if it
***************
*** 179,184 ****
! def train(bayes, msgs, is_spam):
! """Train bayes with all messages from a mailbox."""
def _factory(fp):
try:
--- 181,186 ----
! def getmbox(msgs):
! """Return an iterable mbox object given a file/directory/folder name."""
def _factory(fp):
try:
***************
*** 190,194 ****
import mhlib
mh = mhlib.MH()
! mbox = mailbox.MHMailbox(os.path.join(mh.getpath(), msgs[1:]))
elif os.path.isdir(msgs):
# XXX Bogus: use an MHMailbox if the pathname contains /Mail/,
--- 192,197 ----
import mhlib
mh = mhlib.MH()
! mbox = mailbox.MHMailbox(os.path.join(mh.getpath(), msgs[1:]),
! _factory)
elif os.path.isdir(msgs):
# XXX Bogus: use an MHMailbox if the pathname contains /Mail/,
***************
*** 201,205 ****
--- 204,212 ----
fp = open(msgs)
mbox = mailbox.PortableUnixMailbox(fp, _factory)
+ return mbox
+ def train(bayes, msgs, is_spam):
+ """Train bayes with all messages from a mailbox."""
+ mbox = getmbox(msgs)
i = 0
for msg in mbox:
***************
*** 212,215 ****
--- 219,227 ----
print
+ def formatclues(clues, sep="; "):
+ """Format the clues into something readable."""
+ # XXX Maybe sort by prob first?
+ return sep.join(["%r: %.2f" % (word, prob) for word, prob in clues])
+
def filter(bayes, input, output):
"""Filter (judge) a message"""
***************
*** 221,228 ****
disp = "Yes"
disp += "; %.2f" % prob
! disp += "; " + "; ".join(map(lambda x: "%s: %.2f" % (`x[0]`, x[1]), clues))
msg.add_header("X-Spam-Disposition", disp)
output.write(str(msg))
def usage(code, msg=''):
if msg:
--- 233,259 ----
disp = "Yes"
disp += "; %.2f" % prob
! disp += "; " + formatclues(clues)
msg.add_header("X-Spam-Disposition", disp)
output.write(str(msg))
+ def score(bayes, msgs):
+ """Score (judge) all messages from a mailbox."""
+ # XXX The reporting needs work!
+ mbox = getmbox(msgs)
+ i = 0
+ spams = hams = 0
+ for msg in mbox:
+ i += 1
+ prob, clues = bayes.spamprob(tokenize(str(msg)), True)
+ isspam = prob >= 0.9
+ print "%6d %4.2f %1s" % (i, prob, isspam and "S" or "."),
+ if isspam:
+ spams += 1
+ print formatclues(clues)
+ else:
+ hams += 1
+ print
+ print "Total %d spam, %d ham" % (spams, hams)
+
def usage(code, msg=''):
if msg:
***************
*** 234,238 ****
def main():
try:
! opts, args = getopt.getopt(sys.argv[1:], 'hdfg:s:p:')
except getopt.error, msg:
usage(1, msg)
--- 265,269 ----
def main():
try:
! opts, args = getopt.getopt(sys.argv[1:], 'hdfg:s:p:u:')
except getopt.error, msg:
usage(1, msg)
***************
*** 242,246 ****
pck = "hammie.db"
! good = spam = None
do_filter = usedb = False
for opt, arg in opts:
--- 273,277 ----
pck = "hammie.db"
! good = spam = unknown = None
do_filter = usedb = False
for opt, arg in opts:
***************
*** 257,260 ****
--- 288,293 ----
elif opt == "-f":
do_filter = True
+ elif opt == '-u':
+ unknown = arg
if args:
usage(1)
***************
*** 294,297 ****
--- 327,333 ----
if do_filter:
filter(bayes, sys.stdin, sys.stdout)
+
+ if unknown:
+ score(bayes, unknown)
if __name__ == "__main__":
From gvanrossum@users.sourceforge.net Sat Sep 7 05:23:18 2002
From: gvanrossum@users.sourceforge.net (Guido van Rossum)
Date: Fri, 06 Sep 2002 21:23:18 -0700
Subject: [Spambayes-checkins] spambayes hammie.py,1.6,1.7
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv5312
Modified Files:
hammie.py
Log Message:
Sort the clues before formatting. I definitely like this better.
Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** hammie.py 7 Sep 2002 04:20:43 -0000 1.6
--- hammie.py 7 Sep 2002 04:23:15 -0000 1.7
***************
*** 221,226 ****
def formatclues(clues, sep="; "):
"""Format the clues into something readable."""
! # XXX Maybe sort by prob first?
! return sep.join(["%r: %.2f" % (word, prob) for word, prob in clues])
def filter(bayes, input, output):
--- 221,227 ----
def formatclues(clues, sep="; "):
"""Format the clues into something readable."""
! lst = [(prob, word) for word, prob in clues]
! lst.sort()
! return sep.join(["%r: %.2f" % (word, prob) for prob, word in lst])
def filter(bayes, input, output):
From gvanrossum@users.sourceforge.net Sat Sep 7 05:28:16 2002
From: gvanrossum@users.sourceforge.net (Guido van Rossum)
Date: Fri, 06 Sep 2002 21:28:16 -0700
Subject: [Spambayes-checkins] spambayes README.txt,1.6,1.7
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv6070
Modified Files:
README.txt
Log Message:
Add a clue about the Python version.
Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** README.txt 6 Sep 2002 20:13:31 -0000 1.6
--- README.txt 7 Sep 2002 04:28:13 -0000 1.7
***************
*** 16,19 ****
--- 16,22 ----
negative rate is still over 1%.
+ The code here depends in various ways on the latest Python from CVS
+ (a.k.a. Python 2.3a0 :-).
+
Primary Files
From gvanrossum@users.sourceforge.net Sat Sep 7 05:31:10 2002
From: gvanrossum@users.sourceforge.net (Guido van Rossum)
Date: Fri, 06 Sep 2002 21:31:10 -0700
Subject: [Spambayes-checkins] spambayes hammie.py,1.7,1.8
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv6491
Modified Files:
hammie.py
Log Message:
Minor cleanup; standardize exit codes; add some docs/comments.
Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.7
retrieving revision 1.8
diff -C2 -d -r1.7 -r1.8
*** hammie.py 7 Sep 2002 04:23:15 -0000 1.7
--- hammie.py 7 Sep 2002 04:31:08 -0000 1.8
***************
*** 1,3 ****
--- 1,4 ----
#! /usr/bin/env python
+ # At the moment, this requires Python 2.3 from CVS
# A driver for the classifier module. Currently mostly a wrapper around
***************
*** 27,32 ****
"""
- from __future__ import generators
-
import sys
import os
--- 28,31 ----
***************
*** 40,44 ****
import cPickle as pickle
! program = sys.argv[0]
# Tim's tokenizer kicks far more booty than anything I would have
--- 39,43 ----
import cPickle as pickle
! program = sys.argv[0] # For usage(); referenced by docstring above
# Tim's tokenizer kicks far more booty than anything I would have
***************
*** 258,261 ****
--- 257,261 ----
def usage(code, msg=''):
+ """Print usage message and sys.exit(code)."""
if msg:
print >> sys.stderr, msg
***************
*** 265,275 ****
def main():
try:
opts, args = getopt.getopt(sys.argv[1:], 'hdfg:s:p:u:')
except getopt.error, msg:
! usage(1, msg)
if not opts:
! usage(0, "No options given")
pck = "hammie.db"
--- 265,276 ----
def main():
+ """Main program; parse options and go."""
try:
opts, args = getopt.getopt(sys.argv[1:], 'hdfg:s:p:u:')
except getopt.error, msg:
! usage(2, msg)
if not opts:
! usage(2, "No options given")
pck = "hammie.db"
***************
*** 292,296 ****
unknown = arg
if args:
! usage(1)
save = False
--- 293,297 ----
unknown = arg
if args:
! usage(2, "Positional arguments not allowed")
save = False
From gvanrossum@users.sourceforge.net Sat Sep 7 05:50:12 2002
From: gvanrossum@users.sourceforge.net (Guido van Rossum)
Date: Fri, 06 Sep 2002 21:50:12 -0700
Subject: [Spambayes-checkins] spambayes timtoken.py,1.6,1.7
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv9264
Modified Files:
timtoken.py
Log Message:
Made tokenize() polymorphic. It now accepts an email.Message.Message
instance, a file-like object (something with a readline method), or a
string (anything else). This is a major speed boost for hammie.py,
which has Message objects, but had to convert them to strings before
passing to tokenize(), which parsed the string into a Message object
again!
Index: timtoken.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtoken.py,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** timtoken.py 7 Sep 2002 01:41:28 -0000 1.6
--- timtoken.py 7 Sep 2002 04:50:10 -0000 1.7
***************
*** 2,6 ****
import email
- from email import message_from_string
from sets import Set
--- 2,5 ----
***************
*** 555,566 ****
yield 'content-transfer-encoding:' + x.lower()
! def tokenize(string):
# Create an email Message object.
! try:
! msg = message_from_string(string)
! except email.Errors.MessageParseError:
! yield 'control: MessageParseError'
! # XXX Fall back to the raw body text?
! return
# Special tagging of header lines.
--- 554,570 ----
yield 'content-transfer-encoding:' + x.lower()
! def tokenize(obj):
# Create an email Message object.
! if isinstance(obj, email.Message.Message):
! msg = obj
! elif hasattr(obj, "readline"):
! msg = email.message_from_file(obj)
! else:
! try:
! msg = email.message_from_string(obj)
! except email.Errors.MessageParseError:
! yield 'control: MessageParseError'
! # XXX Fall back to the raw body text?
! return
# Special tagging of header lines.
From gvanrossum@users.sourceforge.net Sat Sep 7 05:50:47 2002
From: gvanrossum@users.sourceforge.net (Guido van Rossum)
Date: Fri, 06 Sep 2002 21:50:47 -0700
Subject: [Spambayes-checkins] spambayes hammie.py,1.8,1.9
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv9362
Modified Files:
hammie.py
Log Message:
Use the new tokenize(), which accepts our Message objects.
Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.8
retrieving revision 1.9
diff -C2 -d -r1.8 -r1.9
*** hammie.py 7 Sep 2002 04:31:08 -0000 1.8
--- hammie.py 7 Sep 2002 04:50:45 -0000 1.9
***************
*** 215,219 ****
sys.stdout.write("\r%6d" % i)
sys.stdout.flush()
! bayes.learn(tokenize(str(msg)), is_spam, False)
print
--- 215,219 ----
sys.stdout.write("\r%6d" % i)
sys.stdout.flush()
! bayes.learn(tokenize(msg), is_spam, False)
print
***************
*** 227,231 ****
"""Filter (judge) a message"""
msg = email.message_from_file(input)
! prob, clues = bayes.spamprob(tokenize(str(msg)), True)
if prob < 0.9:
disp = "No"
--- 227,231 ----
"""Filter (judge) a message"""
msg = email.message_from_file(input)
! prob, clues = bayes.spamprob(tokenize(msg), True)
if prob < 0.9:
disp = "No"
***************
*** 245,249 ****
for msg in mbox:
i += 1
! prob, clues = bayes.spamprob(tokenize(str(msg)), True)
isspam = prob >= 0.9
print "%6d %4.2f %1s" % (i, prob, isspam and "S" or "."),
--- 245,249 ----
for msg in mbox:
i += 1
! prob, clues = bayes.spamprob(tokenize(msg), True)
isspam = prob >= 0.9
print "%6d %4.2f %1s" % (i, prob, isspam and "S" or "."),
From gvanrossum@users.sourceforge.net Sat Sep 7 06:02:58 2002
From: gvanrossum@users.sourceforge.net (Guido van Rossum)
Date: Fri, 06 Sep 2002 22:02:58 -0700
Subject: [Spambayes-checkins] spambayes .cvsignore,NONE,1.1
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv11644
Added Files:
.cvsignore
Log Message:
Ignore certain files.
--- NEW FILE: .cvsignore ---
*.pyc
*.pyo
*.db
*.pik
*.zip
From tim_one@users.sourceforge.net Sat Sep 7 06:11:34 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Fri, 06 Sep 2002 22:11:34 -0700
Subject: [Spambayes-checkins] spambayes classifier.py,1.3,1.4
timtest.py,1.10,1.11
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv12667
Modified Files:
classifier.py timtest.py
Log Message:
Shaking things up! MINCOUNT is history. This yields a major improvement
in the f-n rate, but may have knocked the f-p rate out of a local minimum.
I considered this carefully, and expect you'll agree it's a good change if
you read the new comments. There's a surely a better way to get the tiny
bit of good that was hiding under MINCOUNT's bad effects.
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** classifier.py 6 Sep 2002 22:53:49 -0000 1.3
--- classifier.py 7 Sep 2002 05:11:30 -0000 1.4
***************
*** 58,71 ****
# appropriate bias factor.)
#
! # XXX Reducing this to 1.0 (effectively not using it at all then) seemed to
! # XXX give a sharp reduction in the f-n rate in a partial test run, while
! # XXX adding a few mysterious f-ps. Then boosting it to 2.0 appeared to
! # XXX give an increase in the f-n rate in a partial test run. This needs
! # XXX deeper investigation. Might also be good to develop a more general
! # XXX concept of confidence: MINCOUNT is a gross gimmick in that direction,
! # XXX effectively saying we have no confidence in probabilities computed
! # XXX from fewer than MINCOUNT instances, but unbounded confidence in
! # XXX probabilities computed from at least MINCOUNT instances.
! MINCOUNT = 5.0
# The maximum number of words spamprob() pays attention to. Graham had 15
--- 58,145 ----
# appropriate bias factor.)
#
! # Twist: Graham used MINCOUNT=5.0 here. I got rid of it: in effect,
! # given HAMBIAS=2.0, it meant we ignored a possibly perfectly good piece
! # of spam evidence unless it appeared at least 5 times, and ditto for
! # ham evidence unless it appeared at least 3 times. That certainly does
! # bias in favor of ham, but multiple distortions in favor of ham are
! # multiple ways to get confused and trip up. Here are the test results
! # before and after, MINCOUNT=5.0 on the left, no MINCOUNT on the right;
! # ham sets had 4000 msgs (so 0.025% is one msg), and spam sets 2750:
! #
! # false positive percentages
! # 0.000 0.000 tied
! # 0.000 0.000 tied
! # 0.100 0.050 won -50.00%
! # 0.000 0.025 lost +(was 0)
! # 0.025 0.075 lost +200.00%
! # 0.025 0.000 won -100.00%
! # 0.100 0.100 tied
! # 0.025 0.050 lost +100.00%
! # 0.025 0.025 tied
! # 0.050 0.025 won -50.00%
! # 0.100 0.050 won -50.00%
! # 0.025 0.050 lost +100.00%
! # 0.025 0.050 lost +100.00%
! # 0.025 0.000 won -100.00%
! # 0.025 0.000 won -100.00%
! # 0.025 0.075 lost +200.00%
! # 0.025 0.025 tied
! # 0.000 0.000 tied
! # 0.025 0.025 tied
! # 0.100 0.050 won -50.00%
! #
! # won 7 times
! # tied 7 times
! # lost 6 times
! #
! # total unique fp went from 9 to 13
! #
! # false negative percentages
! # 0.364 0.327 won -10.16%
! # 0.400 0.400 tied
! # 0.400 0.327 won -18.25%
! # 0.909 0.691 won -23.98%
! # 0.836 0.545 won -34.81%
! # 0.618 0.291 won -52.91%
! # 0.291 0.218 won -25.09%
! # 1.018 0.654 won -35.76%
! # 0.982 0.364 won -62.93%
! # 0.727 0.291 won -59.97%
! # 0.800 0.327 won -59.13%
! # 1.163 0.691 won -40.58%
! # 0.764 0.582 won -23.82%
! # 0.473 0.291 won -38.48%
! # 0.473 0.364 won -23.04%
! # 0.727 0.436 won -40.03%
! # 0.655 0.436 won -33.44%
! # 0.509 0.218 won -57.17%
! # 0.545 0.291 won -46.61%
! # 0.509 0.254 won -50.10%
! #
! # won 19 times
! # tied 1 times
! # lost 0 times
! #
! # total unique fn went from 168 to 106
! #
! # So dropping MINCOUNT was a huge win for the f-n rate, and a mixed bag
! # for the f-p rate (but the f-p rate was so low compared to 4000 msgs that
! # even the losses were barely significant). In addition, dropping MINCOUNT
! # had a larger good effect when using random training subsets of size 500;
! # this makes intuitive sense, as with less training data it was harder to
! # exceed the MINCOUNT threshold.
! #
! # Still, MINCOUNT seemed to be a gross approximation to *something* valuable:
! # a strong clue appearing in 1,000 training msgs is certainly more trustworthy
! # than an equally strong clue appearing in only 1 msg. I'm almost certain it
! # would pay to develop a way to take that into account when scoring. In
! # particular, there was a very specific new class of false positives
! # introduced by dropping MINCOUNT: some c.l.py msgs consisting mostly of
! # Spanish or French. The "high probability" spam clues were innocuous
! # words like "puedo" and "como", that appeared in very rare Spanish and
! # French spam too. There has to be a more principled way to address this
! # than the MINCOUNT hammer, and the test results clearly showed that MINCOUNT
! # did more harm than good overall.
!
# The maximum number of words spamprob() pays attention to. Graham had 15
***************
*** 477,493 ****
hamcount = HAMBIAS * record.hamcount
spamcount = SPAMBIAS * record.spamcount
! if hamcount + spamcount < MINCOUNT:
! prob = UNKNOWN_SPAMPROB
! else:
! hamratio = min(1.0, hamcount / nham)
! spamratio = min(1.0, spamcount / nspam)
- prob = spamratio / (hamratio + spamratio)
- if prob < MIN_SPAMPROB:
- prob = MIN_SPAMPROB
- elif prob > MAX_SPAMPROB:
- prob = MAX_SPAMPROB
if record.spamprob != prob:
record.spamprob = prob
self.wordinfo[word] = record
--- 551,567 ----
hamcount = HAMBIAS * record.hamcount
spamcount = SPAMBIAS * record.spamcount
! hamratio = min(1.0, hamcount / nham)
! spamratio = min(1.0, spamcount / nspam)
!
! prob = spamratio / (hamratio + spamratio)
! if prob < MIN_SPAMPROB:
! prob = MIN_SPAMPROB
! elif prob > MAX_SPAMPROB:
! prob = MAX_SPAMPROB
if record.spamprob != prob:
record.spamprob = prob
+ # The next seemingly pointless line appears to be a hack
+ # to allow a persistent db to realize the record has changed.
self.wordinfo[word] = record
***************
*** 497,515 ****
print "P(%r) = %g" % (w, r.spamprob)
! def clearjunk(self, oldesttime, mincount=MINCOUNT):
"""Forget useless wordinfo records. This can shrink the database size.
A record for a word will be retained only if the word was accessed
! at or after oldesttime, or appeared at least mincount times in
! messages passed to learn(). mincount is optional, and defaults
! to the value an internal algorithm uses to decide that a word is so
! rare that it has no predictive value.
"""
wordinfo = self.wordinfo
mincount = float(mincount)
! tonuke = [w for w, r in wordinfo.iteritems()
! if r.atime < oldesttime and
! SPAMBIAS*r.spamcount + HAMBIAS*r.hamcount < mincount]
for w in tonuke:
if self.DEBUG:
--- 571,584 ----
print "P(%r) = %g" % (w, r.spamprob)
! def clearjunk(self, oldesttime):
"""Forget useless wordinfo records. This can shrink the database size.
A record for a word will be retained only if the word was accessed
! at or after oldesttime.
"""
wordinfo = self.wordinfo
mincount = float(mincount)
! tonuke = [w for w, r in wordinfo.iteritems() if r.atime < oldesttime]
for w in tonuke:
if self.DEBUG:
Index: timtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtest.py,v
retrieving revision 1.10
retrieving revision 1.11
diff -C2 -d -r1.10 -r1.11
*** timtest.py 7 Sep 2002 00:31:56 -0000 1.10
--- timtest.py 7 Sep 2002 05:11:31 -0000 1.11
***************
*** 98,101 ****
--- 98,110 ----
yield Msg(directory, fname)
+ def xproduce(self):
+ import random
+ directory = self.directory
+ all = os.listdir(directory)
+ random.seed(hash(directory))
+ random.shuffle(all)
+ for fname in all[-500:]:
+ yield Msg(directory, fname)
+
def __iter__(self):
return self.produce()
From tim.one@comcast.net Sat Sep 7 06:20:47 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 07 Sep 2002 01:20:47 -0400
Subject: [Spambayes-checkins] spambayes timtoken.py,1.6,1.7
In-Reply-To:
Message-ID:
[Guido]
> Modified Files:
> timtoken.py
> Log Message:
> Made tokenize() polymorphic. It now accepts an email.Message.Message
> instance, a file-like object (something with a readline method), or a
> string (anything else).
Good change. One question/concern:
> ...
> --- 2,5 ----
> ***************
> *** 555,566 ****
> yield 'content-transfer-encoding:' + x.lower()
>
> ! def tokenize(string):
> # Create an email Message object.
> ! try:
> ! msg = message_from_string(string)
> ! except email.Errors.MessageParseError:
> ! yield 'control: MessageParseError'
> ! # XXX Fall back to the raw body text?
> ! return
>
> # Special tagging of header lines.
> --- 554,570 ----
> yield 'content-transfer-encoding:' + x.lower()
>
> ! def tokenize(obj):
> # Create an email Message object.
> ! if isinstance(obj, email.Message.Message):
> ! msg = obj
> ! elif hasattr(obj, "readline"):
> ! msg = email.message_from_file(obj)
> ! else:
> ! try:
> ! msg = email.message_from_string(obj)
> ! except email.Errors.MessageParseError:
> ! yield 'control: MessageParseError'
> ! # XXX Fall back to the raw body text?
> ! return
>
> # Special tagging of header lines.
It's a fact of life that some messages can't be parsed by the email package,
and the code was careful to catch that when parsing from a string. I don't
see anything here to protect the system from dying if a message can't be
parsed from file. Barry, when would MessageParseError get raised then? At
the time message_from_file() is called (in which case fixing the above is
easy), or at some later time when trying to invoke some method of the
Message object (in which case I'm not sure what to do)?
From guido@python.org Sat Sep 7 06:35:37 2002
From: guido@python.org (Guido van Rossum)
Date: Sat, 07 Sep 2002 01:35:37 -0400
Subject: [Spambayes-checkins] spambayes timtoken.py,1.6,1.7
In-Reply-To: Your message of "Sat, 07 Sep 2002 01:20:47 EDT."
References:
Message-ID: <200209070535.g875Zbm13523@pcp02138704pcs.reston01.va.comcast.net>
> > Made tokenize() polymorphic. It now accepts an email.Message.Message
> > instance, a file-like object (something with a readline method), or a
> > string (anything else).
>
> Good change. One question/concern:
>
> > ! def tokenize(obj):
> > # Create an email Message object.
> > ! if isinstance(obj, email.Message.Message):
> > ! msg = obj
> > ! elif hasattr(obj, "readline"):
> > ! msg = email.message_from_file(obj)
> > ! else:
> > ! try:
> > ! msg = email.message_from_string(obj)
> > ! except email.Errors.MessageParseError:
> > ! yield 'control: MessageParseError'
> > ! # XXX Fall back to the raw body text?
> > ! return
> >
> > # Special tagging of header lines.
>
> It's a fact of life that some messages can't be parsed by the email package,
> and the code was careful to catch that when parsing from a string. I don't
> see anything here to protect the system from dying if a message can't be
> parsed from file. Barry, when would MessageParseError get raised then? At
> the time message_from_file() is called (in which case fixing the above is
> easy), or at some later time when trying to invoke some method of the
> Message object (in which case I'm not sure what to do)?
I'm guessing at the time that message_from_file() is called;
message_from_string() is a thin layer on top of that using StringIO,
so if the above code works for message_from_string(), it should work
for message_from_file(). I'll add it.
--Guido van Rossum (home page: http://www.python.org/~guido/)
From gvanrossum@users.sourceforge.net Sat Sep 7 06:43:11 2002
From: gvanrossum@users.sourceforge.net (Guido van Rossum)
Date: Fri, 06 Sep 2002 22:43:11 -0700
Subject: [Spambayes-checkins] spambayes timtoken.py,1.7,1.8
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv18320
Modified Files:
timtoken.py
Log Message:
Catch MessageParseError when calling message_from_file() too.
Index: timtoken.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtoken.py,v
retrieving revision 1.7
retrieving revision 1.8
diff -C2 -d -r1.7 -r1.8
*** timtoken.py 7 Sep 2002 04:50:10 -0000 1.7
--- timtoken.py 7 Sep 2002 05:43:08 -0000 1.8
***************
*** 559,563 ****
msg = obj
elif hasattr(obj, "readline"):
! msg = email.message_from_file(obj)
else:
try:
--- 559,568 ----
msg = obj
elif hasattr(obj, "readline"):
! try:
! msg = email.message_from_file(obj)
! except email.Errors.MessageParseError:
! yield 'control: MessageParseError'
! # XXX Fall back to the raw body text?
! return
else:
try:
From montanaro@users.sourceforge.net Sat Sep 7 06:50:44 2002
From: montanaro@users.sourceforge.net (Skip Montanaro)
Date: Fri, 06 Sep 2002 22:50:44 -0700
Subject: [Spambayes-checkins] spambayes unheader.py,NONE,1.1
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv19376
Added Files:
unheader.py
Log Message:
script to remove unwanted headers from mbox files
--- NEW FILE: unheader.py ---
#!/usr/bin/env python
import re
import sys
import mailbox
import email.Parser
import email.Message
import getopt
def unheader(msg, pat):
pat = re.compile(pat)
for hdr in msg.keys():
if pat.match(hdr):
del msg[hdr]
class Message(email.Message.Message):
def replace_header(self, hdr, newval):
"""replace first value for hdr with newval"""
hdr = hdr.lower()
for (i, (k, v)) in enumerate(self._headers):
if k.lower() == hdr:
self._headers[i] = (k, newval)
class Parser(email.Parser.Parser):
def __init__(self):
email.Parser.Parser.__init__(self, Message)
def deSA(msg):
if msg['X-Spam-Status']:
if msg['X-Spam-Status'].startswith('Yes'):
pct = msg['X-Spam-Prev-Content-Type']
if pct:
msg['Content-Type'] = pct
pcte = msg['X-Spam-Prev-Content-Transfer-Encoding']
if pcte:
msg['Content-Transfer-Encoding'] = pcte
subj = re.sub(r'\*\*\*\*\*SPAM\*\*\*\*\* ', '', msg['Subject'])
if subj != msg["Subject"]:
msg.replace_header("Subject", subj)
body = msg.get_payload()
newbody = []
at_start = 1
for line in body.splitlines():
if at_start and line.startswith('SPAM: '):
continue
elif at_start:
at_start = 0
else:
newbody.append(line)
msg.set_payload("\n".join(newbody))
unheader(msg, "X-Spam-")
def process_mailbox(f, dosa=1, pats=None):
for msg in mailbox.PortableUnixMailbox(f, Parser().parse):
if pats is not None:
unheader(msg, pats)
if dosa:
deSA(msg)
print msg
def usage():
print >> sys.stderr, "usage: unheader.py [ -p pat ... ] [ -s ]"
print >> sys.stderr, "-p pat gives a regex pattern used to eliminate unwanted headers"
print >> sys.stderr, "'-p pat' may be given multiple times"
print >> sys.stderr, "-s tells not to remove SpamAssassin headers"
def main(args):
headerpats = []
dosa = 1
try:
opts, args = getopt.getopt(args, "p:sh")
except getopt.GetoptError:
usage()
sys.exit(1)
else:
for opt, arg in opts:
if opt == "-h":
usage()
sys.exit(0)
elif opt == "-p":
headerpats.append(arg)
elif opt == "-s":
dosa = 0
pats = headerpats and "|".join(headerpats) or None
if not args:
f = sys.stdin
elif len(args) == 1:
f = file(args[0])
else:
usage()
sys.exit(1)
process_mailbox(f, dosa, pats)
if __name__ == "__main__":
main(sys.argv[1:])
From montanaro@users.sourceforge.net Sat Sep 7 06:51:07 2002
From: montanaro@users.sourceforge.net (Skip Montanaro)
Date: Fri, 06 Sep 2002 22:51:07 -0700
Subject: [Spambayes-checkins] spambayes README.txt,1.7,1.8
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv19440
Modified Files:
README.txt
Log Message:
add blurb about unheader.py
Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.7
retrieving revision 1.8
diff -C2 -d -r1.7 -r1.8
*** README.txt 7 Sep 2002 04:28:13 -0000 1.7
--- README.txt 7 Sep 2002 05:51:05 -0000 1.8
***************
*** 50,53 ****
--- 50,57 ----
tokenize() function of your choosing.
+ unheader.py
+ A script to remove unwanted headers from an mbox file. This is mostly
+ useful to delete headers which incorrectly might bias the results.
+
GBayes.py
A number of tokenizers and a partial test driver. This assumes
From montanaro@users.sourceforge.net Sat Sep 7 06:52:50 2002
From: montanaro@users.sourceforge.net (Skip Montanaro)
Date: Fri, 06 Sep 2002 22:52:50 -0700
Subject: [Spambayes-checkins] spambayes setup.py,1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv19640
Modified Files:
setup.py
Log Message:
* handle timtoken.py, unheader.py and hammie.py
* zap GBayes.py
* should timtoken and classifier go into a spambayes package in site-packages?
Index: setup.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/setup.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** setup.py 5 Sep 2002 16:16:43 -0000 1.1
--- setup.py 7 Sep 2002 05:52:48 -0000 1.2
***************
*** 3,8 ****
setup(
name='spambayes',
! scripts=['GBayes.py'],
! py_modules=['classifier']
)
--- 3,8 ----
setup(
name='spambayes',
! scripts=['unheader.py', 'hammie.py'],
! py_modules=['classifier', 'timtoken']
)
From montanaro@users.sourceforge.net Sat Sep 7 06:53:15 2002
From: montanaro@users.sourceforge.net (Skip Montanaro)
Date: Fri, 06 Sep 2002 22:53:15 -0700
Subject: [Spambayes-checkins] spambayes .cvsignore,1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv19725
Modified Files:
.cvsignore
Log Message:
ignore the distutils build dir
Index: .cvsignore
===================================================================
RCS file: /cvsroot/spambayes/spambayes/.cvsignore,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** .cvsignore 7 Sep 2002 05:02:56 -0000 1.1
--- .cvsignore 7 Sep 2002 05:53:12 -0000 1.2
***************
*** 4,5 ****
--- 4,6 ----
*.pik
*.zip
+ build
From rubiconx@users.sourceforge.net Sat Sep 7 07:11:12 2002
From: rubiconx@users.sourceforge.net (Neale Pickett)
Date: Fri, 06 Sep 2002 23:11:12 -0700
Subject: [Spambayes-checkins] spambayes hammie.py,1.9,1.10
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv22737
Modified Files:
hammie.py
Log Message:
Changes X-Spam-Disposition header to X-Hammie-Disposition
Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.9
retrieving revision 1.10
diff -C2 -d -r1.9 -r1.10
*** hammie.py 7 Sep 2002 04:50:45 -0000 1.9
--- hammie.py 7 Sep 2002 06:11:10 -0000 1.10
***************
*** 3,7 ****
# A driver for the classifier module. Currently mostly a wrapper around
! # existing stuff.
"""Usage: %(program)s [options]
--- 3,8 ----
# A driver for the classifier module. Currently mostly a wrapper around
! # existing stuff. Neale Pickett is the person to
! # blame for this.
"""Usage: %(program)s [options]
***************
*** 41,44 ****
--- 42,48 ----
program = sys.argv[0] # For usage(); referenced by docstring above
+ # Name of the header to add in filter mode
+ DISPHEADER = "X-Hammie-Disposition"
+
# Tim's tokenizer kicks far more booty than anything I would have
# written. Score one for analysis ;)
***************
*** 75,79 ****
raise KeyError(key)
! def __setitem__(self, key, val):
v = pickle.dumps(val, 1)
self.hash[key] = v
--- 79,83 ----
raise KeyError(key)
! def __setitem__(self, key, val):
v = pickle.dumps(val, 1)
self.hash[key] = v
***************
*** 86,90 ****
while k != None:
key = k[0]
! val = pickle.loads(k[1])
if key not in self.iterskip:
if fn:
--- 90,94 ----
while k != None:
key = k[0]
! val = self.__getitem__(key)
if key not in self.iterskip:
if fn:
***************
*** 234,238 ****
disp += "; %.2f" % prob
disp += "; " + formatclues(clues)
! msg.add_header("X-Spam-Disposition", disp)
output.write(str(msg))
--- 238,242 ----
disp += "; %.2f" % prob
disp += "; " + formatclues(clues)
! msg.add_header(DISPHEADER, disp)
output.write(str(msg))
From gvanrossum@users.sourceforge.net Sat Sep 7 07:18:05 2002
From: gvanrossum@users.sourceforge.net (Guido van Rossum)
Date: Fri, 06 Sep 2002 23:18:05 -0700
Subject: [Spambayes-checkins] spambayes hammie.py,1.10,1.11
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv24222
Modified Files:
hammie.py
Log Message:
filter(): output 'unixfrom' line only if it was present on input.
Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.10
retrieving revision 1.11
diff -C2 -d -r1.10 -r1.11
*** hammie.py 7 Sep 2002 06:11:10 -0000 1.10
--- hammie.py 7 Sep 2002 06:18:03 -0000 1.11
***************
*** 239,243 ****
disp += "; " + formatclues(clues)
msg.add_header(DISPHEADER, disp)
! output.write(str(msg))
def score(bayes, msgs):
--- 239,243 ----
disp += "; " + formatclues(clues)
msg.add_header(DISPHEADER, disp)
! output.write(msg.as_string(unixfrom=(msg.get_unixfrom() is not None)))
def score(bayes, msgs):
From jhylton@users.sourceforge.net Sat Sep 7 17:14:11 2002
From: jhylton@users.sourceforge.net (Jeremy Hylton)
Date: Sat, 07 Sep 2002 09:14:11 -0700
Subject: [Spambayes-checkins] spambayes tokenizer.py,NONE,1.1
README.txt,1.8,1.9
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv17305
Modified Files:
README.txt
Added Files:
tokenizer.py
Log Message:
Refactor timtoken into tokenzier.
The Tokenizer class has two methods tokenize_headers() and
tokenize_body() that encapsulate most of timtoken's logic. It is a
little easier to extend this class than timtoken, because you can
override either header or body processing individually.
--- NEW FILE: tokenizer.py ---
"""Module to tokenize email messages for spam filtering."""
import email
import re
from sets import Set
# Find all the text components of the msg. There's no point decoding
# binary blobs (like images). If a multipart/alternative has both plain
# text and HTML versions of a msg, ignore the HTML part: HTML decorations
# have monster-high spam probabilities, and innocent newbies often post
# using HTML.
def textparts(msg):
text = Set()
redundant_html = Set()
for part in msg.walk():
if part.get_content_type() == 'multipart/alternative':
# Descend this part of the tree, adding any redundant HTML text
# part to redundant_html.
htmlpart = textpart = None
stack = part.get_payload()
while stack:
subpart = stack.pop()
ctype = subpart.get_content_type()
if ctype == 'text/plain':
textpart = subpart
elif ctype == 'text/html':
htmlpart = subpart
elif ctype == 'multipart/related':
stack.extend(subpart.get_payload())
if textpart is not None:
text.add(textpart)
if htmlpart is not None:
redundant_html.add(htmlpart)
elif htmlpart is not None:
text.add(htmlpart)
elif part.get_content_maintype() == 'text':
text.add(part)
return text - redundant_html
##############################################################################
# To fold case or not to fold case? I didn't want to fold case, because
# it hides information in English, and I have no idea what .lower() does
# to other languages; and, indeed, 'FREE' (all caps) turned out to be one
# of the strongest spam indicators in my content-only tests (== one with
# prob 0.99 *and* made it into spamprob's nbest list very often).
#
# Against preservering case, it makes the database size larger, and requires
# more training data to get enough "representative" mixed-case examples.
#
# Running my c.l.py tests didn't support my intuition that case was
# valuable, so it's getting folded away now. Folding or not made no
# significant difference to the false positive rate, and folding made a
# small (but statistically significant all the same) reduction in the
# false negative rate. There is one obvious difference: after folding
# case, conference announcements no longer got high spam scores. Their
# content was usually fine, but they were highly penalized for VISIT OUR
# WEBSITE FOR MORE INFORMATION! kinds of repeated SCREAMING. That is
# indeed the language of advertising, and I halfway regret that folding
# away case no longer picks on them.
#
# Since the f-p rate didn't change, but conference announcements escaped
# that category, something else took their place. It seems to be highly
# off-topic messages, like debates about Microsoft's place in the world.
# Talk about "money" and "lucrative" is indistinguishable now from talk
# about "MONEY" and "LUCRATIVE", and spam mentions MONEY a lot.
##############################################################################
# Character n-grams or words?
#
# With careful multiple-corpora c.l.py tests sticking to case-folded decoded
# text-only portions, and ignoring headers, and with identical special
# parsing & tagging of embedded URLs:
#
# Character 3-grams gave 5x as many false positives as split-on-whitespace
# (s-o-w). The f-n rate was also significantly worse, but within a factor
# of 2. So character 3-grams lost across the board.
#
# Character 5-grams gave 32% more f-ps than split-on-whitespace, but the
# s-o-w fp rate across 20,000 presumed-hams was 0.1%, and this is the
# difference between 23 and 34 f-ps. There aren't enough there to say that's
# significnatly more with killer-high confidence. There were plenty of f-ns,
# though, and the f-n rate with character 5-grams was substantially *worse*
# than with character 3-grams (which in turn was substantially worse than
# with s-o-w).
#
# Training on character 5-grams creates many more unique tokens than s-o-w:
# a typical run bloated to 150MB process size. It also ran a lot slower than
# s-o-w, partly related to heavy indexing of a huge out-of-cache wordinfo
# dict. I rarely noticed disk activity when running s-o-w, so rarely bothered
# to look at process size; it was under 30MB last time I looked.
#
# Figuring out *why* a msg scored as it did proved much more mysterious when
# working with character n-grams: they often had no obvious "meaning". In
# contrast, it was always easy to figure out what s-o-w was picking up on.
# 5-grams flagged a msg from Christian Tismer as spam, where he was discussing
# the speed of tasklets under his new implementation of stackless:
#
# prob = 0.99999998959
# prob('ed sw') = 0.01
# prob('http0:pgp') = 0.01
# prob('http0:python') = 0.01
# prob('hlon ') = 0.99
# prob('http0:wwwkeys') = 0.01
# prob('http0:starship') = 0.01
# prob('http0:stackless') = 0.01
# prob('n xp ') = 0.99
# prob('on xp') = 0.99
# prob('p 150') = 0.99
# prob('lon x') = 0.99
# prob(' amd ') = 0.99
# prob(' xp 1') = 0.99
# prob(' athl') = 0.99
# prob('1500+') = 0.99
# prob('xp 15') = 0.99
#
# The spam decision was baffling until I realized that *all* the high-
# probablity spam 5-grams there came out of a single phrase:
#
# AMD Athlon XP 1500+
#
# So Christian was punished for using a machine lots of spam tries to sell
# . In a classic Bayesian classifier, this probably wouldn't have
# mattered, but Graham's throws away almost all the 5-grams from a msg,
# saving only the about-a-dozen farthest from a neutral 0.5. So one bad
# phrase can kill you! This appears to happen very rarely, but happened
# more than once.
#
# The conclusion is that character n-grams have almost nothing to recommend
# them under Graham's scheme: harder to work with, slower, much larger
# database, worse results, and prone to rare mysterious disasters.
#
# There's one area they won hands-down: detecting spam in what I assume are
# Asian languages. The s-o-w scheme sometimes finds only line-ends to split
# on then, and then a "hey, this 'word' is way too big! let's ignore it"
# gimmick kicks in, and produces no tokens at all.
#
# [Later: we produce character 5-grams then under the s-o-w scheme, instead
# ignoring the blob, but only if there are high-bit characters in the blob;
# e.g., there's no point 5-gramming uuencoded lines, and doing so would
# bloat the database size.]
#
# Interesting: despite that odd example above, the *kinds* of f-p mistakes
# 5-grams made were very much like s-o-w made -- I recognized almost all of
# the 5-gram f-p messages from previous s-o-w runs. For example, both
# schemes have a particular hatred for conference announcements, although
# s-o-w stopped hating them after folding case. But 5-grams still hate them.
# Both schemes also hate msgs discussing HTML with examples, with about equal
# passion. Both schemes hate brief "please subscribe [unsubscribe] me"
# msgs, although 5-grams seems to hate them more.
##############################################################################
# How to tokenize?
#
# I started with string.split() merely for speed. Over time I realized it
# was making interesting context distinctions qualitatively akin to n-gram
# schemes; e.g., "free!!" is a much stronger spam indicator than "free". But
# unlike n-grams (whether word- or character- based) under Graham's scoring
# scheme, this mild context dependence never seems to go over the edge in
# giving "too much" credence to an unlucky phrase.
#
# OTOH, compared to "searching for words", it increases the size of the
# database substantially, less than but close to a factor of 2. This is very
# much less than a word bigram scheme bloats it, but as always an increase
# isn't justified unless the results are better.
#
# Following are stats comparing
#
# for token in text.split(): # left column
#
# to
#
# for token in re.findall(r"[\w$\-\x80-\xff]+", text): # right column
#
# text is case-normalized (text.lower()) in both cases, and the runs were
# identical in all other respects. The results clearly favor the split()
# gimmick, although they vaguely suggest that some sort of compromise
# may do as well with less database burden; e.g., *perhaps* folding runs of
# "punctuation" characters into a canonical representative could do that.
# But the database size is reasonable without that, and plain split() avoids
# having to worry about how to "fold punctuation" in languages other than
# English.
#
# false positive percentages
# 0.000 0.000 tied
# 0.000 0.050 lost
# 0.050 0.150 lost
# 0.000 0.025 lost
# 0.025 0.050 lost
# 0.025 0.075 lost
# 0.050 0.150 lost
# 0.025 0.000 won
# 0.025 0.075 lost
# 0.000 0.025 lost
# 0.075 0.150 lost
# 0.050 0.050 tied
# 0.025 0.050 lost
# 0.000 0.025 lost
# 0.050 0.025 won
# 0.025 0.000 won
# 0.025 0.025 tied
# 0.000 0.025 lost
# 0.025 0.075 lost
# 0.050 0.175 lost
#
# won 3 times
# tied 3 times
# lost 14 times
#
# total unique fp went from 8 to 20
#
# false negative percentages
# 0.945 1.200 lost
# 0.836 1.018 lost
# 1.200 1.200 tied
# 1.418 1.636 lost
# 1.455 1.418 won
# 1.091 1.309 lost
# 1.091 1.272 lost
# 1.236 1.563 lost
# 1.564 1.855 lost
# 1.236 1.491 lost
# 1.563 1.599 lost
# 1.563 1.781 lost
# 1.236 1.709 lost
# 0.836 0.982 lost
# 0.873 1.382 lost
# 1.236 1.527 lost
# 1.273 1.418 lost
# 1.018 1.273 lost
# 1.091 1.091 tied
# 1.490 1.454 won
#
# won 2 times
# tied 2 times
# lost 16 times
#
# total unique fn went from 292 to 302
##############################################################################
# What about HTML?
#
# Computer geeks seem to view use of HTML in mailing lists and newsgroups as
# a mortal sin. Normal people don't, but so it goes: in a technical list/
# group, every HTML decoration has spamprob 0.99, there are lots of unique
# HTML decorations, and lots of them appear at the very start of the message
# so that Graham's scoring scheme latches on to them tight. As a result,
# any plain text message just containing an HTML example is likely to be
# judged spam (every HTML decoration is an extreme).
#
# So if a message is multipart/alternative with both text/plain and text/html
# branches, we ignore the latter, else newbies would never get a message
# through. If a message is just HTML, it has virtually no chance of getting
# through.
#
# In an effort to let normal people use mailing lists too , and to
# alleviate the woes of messages merely *discussing* HTML practice, I
# added a gimmick to strip HTML tags after case-normalization and after
# special tagging of embedded URLs. This consisted of a regexp sub pattern,
# where instances got replaced by single blanks:
#
# html_re = re.compile(r"""
# <
# [^\s<>] # e.g., don't match 'a < b' or '<<<' or 'i << 5' or 'a<>b'
# [^>]{0,128} # search for the end '>', but don't chew up the world
# >
# """, re.VERBOSE)
#
# and then
#
# text = html_re.sub(' ', text)
#
# Alas, little good came of this:
#
# false positive percentages
# 0.000 0.000 tied
# 0.000 0.000 tied
# 0.050 0.075 lost
# 0.000 0.000 tied
# 0.025 0.025 tied
# 0.025 0.025 tied
# 0.050 0.050 tied
# 0.025 0.025 tied
# 0.025 0.025 tied
# 0.000 0.050 lost
# 0.075 0.100 lost
# 0.050 0.050 tied
# 0.025 0.025 tied
# 0.000 0.025 lost
# 0.050 0.050 tied
# 0.025 0.025 tied
# 0.025 0.025 tied
# 0.000 0.000 tied
# 0.025 0.050 lost
# 0.050 0.050 tied
#
# won 0 times
# tied 15 times
# lost 5 times
#
# total unique fp went from 8 to 12
#
# false negative percentages
# 0.945 1.164 lost
# 0.836 1.418 lost
# 1.200 1.272 lost
# 1.418 1.272 won
# 1.455 1.273 won
# 1.091 1.382 lost
# 1.091 1.309 lost
# 1.236 1.381 lost
# 1.564 1.745 lost
# 1.236 1.564 lost
# 1.563 1.781 lost
# 1.563 1.745 lost
# 1.236 1.455 lost
# 0.836 0.982 lost
# 0.873 1.309 lost
# 1.236 1.381 lost
# 1.273 1.273 tied
# 1.018 1.273 lost
# 1.091 1.200 lost
# 1.490 1.599 lost
#
# won 2 times
# tied 1 times
# lost 17 times
#
# total unique fn went from 292 to 327
#
# The messages merely discussing HTML were no longer fps, so it did what it
# intended there. But the f-n rate nearly doubled on at least one run -- so
# strong a set of spam indicators is the mere presence of HTML. The increase
# in the number of fps despite that the HTML-discussing msgs left that
# category remains mysterious to me, but it wasn't a significant increase
# so I let it drop.
#
# Later: If I simply give up on making mailing lists friendly to my sisters
# (they're not nerds, and create wonderfully attractive HTML msgs), a
# compromise is to strip HTML tags from only text/plain msgs. That's
# principled enough so far as it goes, and eliminates the HTML-discussing
# false positives. It remains disturbing that the f-n rate on pure HTML
# msgs increases significantly when stripping tags, so the code here doesn't
# do that part. However, even after stripping tags, the rates above show that
# at least 98% of spams are still correctly identified as spam.
# XXX So, if another way is found to slash the f-n rate, the decision here
# XXX not to strip HTML from HTML-only msgs should be revisited.
url_re = re.compile(r"""
(https? | ftp) # capture the protocol
:// # skip the boilerplate
# Do a reasonable attempt at detecting the end. It may or may not
# be in HTML, may or may not be in quotes, etc. If it's full of %
# escapes, cool -- that's a clue too.
([^\s<>'"\x7f-\xff]+) # capture the guts
""", re.VERBOSE)
urlsep_re = re.compile(r"[;?:@&=+,$.]")
has_highbit_char = re.compile(r"[\x80-\xff]").search
# Cheap-ass gimmick to probabilistically find HTML/XML tags.
html_re = re.compile(r"""
<
[^\s<>] # e.g., don't match 'a < b' or '<<<' or 'i << 5' or 'a<>b'
[^>]{0,128} # search for the end '>', but don't run wild
>
""", re.VERBOSE)
# I'm usually just splitting on whitespace, but for subject lines I want to
# break things like "Python/Perl comparison?" up. OTOH, I don't want to
# break up the unitized numbers in spammish subject phrases like "Increase
# size 79%" or "Now only $29.95!". Then again, I do want to break up
# "Python-Dev".
subject_word_re = re.compile(r"[\w\x80-\xff$.%]+")
def tokenize_word(word, _len=len):
n = _len(word)
# XXX How big should "a word" be?
# XXX I expect 12 is fine -- a test run boosting to 13 had no effect
# XXX on f-p rate, and did a little better or worse than 12 across
# XXX runs -- overall, no significant difference. It's only "common
# XXX sense" so far driving the exclusion of lengths 1 and 2.
# Make sure this range matches in tokenize().
if 3 <= n <= 12:
yield word
elif n >= 3:
# A long word.
# Don't want to skip embedded email addresses.
if n < 40 and '.' in word and word.count('@') == 1:
p1, p2 = word.split('@')
yield 'email name:' + p1
for piece in p2.split('.'):
yield 'email addr:' + piece
# If there are any high-bit chars,
# tokenize it as byte 5-grams.
# XXX This really won't work for high-bit languages -- the scoring
# XXX scheme throws almost everything away, and one bad phrase can
# XXX generate enough bad 5-grams to dominate the final score.
# XXX This also increases the database size substantially.
elif has_highbit_char(word):
for i in xrange(n-4):
yield "5g:" + word[i : i+5]
else:
# It's a long string of "normal" chars. Ignore it.
# For example, it may be an embedded URL (which we already
# tagged), or a uuencoded line.
# There's value in generating a token indicating roughly how
# many chars were skipped. This has real benefit for the f-n
# rate, but is neutral for the f-p rate. I don't know why!
# XXX Figure out why, and/or see if some other way of summarizing
# XXX this info has greater benefit.
yield "skip:%c %d" % (word[0], n // 10 * 10)
# Generate tokens for:
# Content-Type
# and its type= param
# Content-Dispostion
# and its filename= param
# all the charsets
#
# This has huge benefit for the f-n rate, and virtually none on the f-p rate,
# although it does reduce the variance of the f-p rate across different
# training sets (really marginal msgs, like a brief HTML msg saying just
# "unsubscribe me", are almost always tagged as spam now; before they were
# right on the edge, and now the multipart/alternative pushes them over it
# more consistently).
#
# XXX I put all of this in as one chunk. I don't know which parts are
# XXX most effective; it could be that some parts don't help at all. But
# XXX given the nature of the c.l.py tests, it's not surprising that the
# XXX 'content-type:text/html'
# XXX token is now the single most powerful spam indicator (== makes it
# XXX into the nbest list most often). What *is* a little surprising is
# XXX that this doesn't push more mixed-type msgs into the f-p camp --
# XXX unlike looking at *all* HTML tags, this is just one spam indicator
# XXX instead of dozens, so relevant msg content can cancel it out.
#
# A bug in this code prevented Content-Transfer-Encoding from getting
# picked up. Fixing that bug showed that it didn't helpe, so the corrected
# code is disabled now (left column without Content-Transfer-Encoding,
# right column with it);
#
# false positive percentages
# 0.000 0.000 tied
# 0.000 0.000 tied
# 0.100 0.100 tied
# 0.000 0.000 tied
# 0.025 0.025 tied
# 0.025 0.025 tied
# 0.100 0.100 tied
# 0.025 0.025 tied
# 0.025 0.025 tied
# 0.050 0.050 tied
# 0.100 0.100 tied
# 0.025 0.025 tied
# 0.025 0.025 tied
# 0.025 0.025 tied
# 0.025 0.025 tied
# 0.025 0.025 tied
# 0.025 0.025 tied
# 0.000 0.025 lost +(was 0)
# 0.025 0.025 tied
# 0.100 0.100 tied
#
# won 0 times
# tied 19 times
# lost 1 times
#
# total unique fp went from 9 to 10
#
# false negative percentages
# 0.364 0.400 lost +9.89%
# 0.400 0.364 won -9.00%
# 0.400 0.436 lost +9.00%
# 0.909 0.872 won -4.07%
# 0.836 0.836 tied
# 0.618 0.618 tied
# 0.291 0.291 tied
# 1.018 0.981 won -3.63%
# 0.982 0.982 tied
# 0.727 0.727 tied
# 0.800 0.800 tied
# 1.163 1.127 won -3.10%
# 0.764 0.836 lost +9.42%
# 0.473 0.473 tied
# 0.473 0.618 lost +30.66%
# 0.727 0.763 lost +4.95%
# 0.655 0.618 won -5.65%
# 0.509 0.473 won -7.07%
# 0.545 0.582 lost +6.79%
# 0.509 0.509 tied
#
# won 6 times
# tied 8 times
# lost 6 times
#
# total unique fn went from 168 to 169
def crack_content_xyz(msg):
x = msg.get_type()
if x is not None:
yield 'content-type:' + x.lower()
x = msg.get_param('type')
if x is not None:
yield 'content-type/type:' + x.lower()
for x in msg.get_charsets(None):
if x is not None:
yield 'charset:' + x.lower()
x = msg.get('content-disposition')
if x is not None:
yield 'content-disposition:' + x.lower()
fname = msg.get_filename()
if fname is not None:
for x in fname.lower().split('/'):
for y in x.split('.'):
yield 'filename:' + y
if 0: # disabled; see comment before function
x = msg.get('content-transfer-encoding')
if x is not None:
yield 'content-transfer-encoding:' + x.lower()
class Tokenizer:
def get_message(self, obj):
if isinstance(obj, email.Message.Message):
return obj
else:
# Create an email Message object.
try:
if hasattr(obj, "readline"):
return email.message_from_file(obj)
else:
return email.message_from_string(obj)
except email.Errors.MessageParseError:
return None
def tokenize(self, obj):
msg = self.get_message(obj)
if msg is None:
yield 'control: MessageParseError'
# XXX Fall back to the raw body text?
return
for tok in self.tokenize_headers(msg):
yield tok
for tok in self.tokenize_body(msg):
yield tok
def tokenize_headers(self, msg):
# Special tagging of header lines.
# XXX TODO Neil Schemenauer has gotten a good start on this
# XXX (pvt email). The headers in my spam and ham corpora are
# XXX so different (they came from different sources) that if
# XXX I include them the classifier's job is trivial. Only
# XXX some "safe" header lines are included here, where "safe"
# XXX is specific to my sorry corpora.
# Content-{Type, Disposition} and their params, and charsets.
t = ''
for x in msg.walk():
for w in crack_content_xyz(x):
yield t + w
t = '>'
# Subject:
# Don't ignore case in Subject lines; e.g., 'free' versus 'FREE' is
# especially significant in this context. Experiment showed a small
# but real benefit to keeping case intact in this specific context.
x = msg.get('subject', '')
for w in subject_word_re.findall(x):
for t in tokenize_word(w):
yield 'subject:' + t
# Dang -- I can't use Sender:. If I do,
# 'sender:email name:python-list-admin'
# becomes the most powerful indicator in the whole database.
#
# From:
# Reply-To:
for field in ('from',):# 'reply-to',):
prefix = field + ':'
x = msg.get(field, 'none').lower()
for w in x.split():
for t in tokenize_word(w):
yield prefix + t
# These headers seem to work best if they're not tokenized: just
# normalize case and whitespace.
# X-Mailer: This is a pure and significant win for the f-n rate; f-p
# rate isn't affected.
# User-Agent: Skipping it, as it made no difference. Very few spams
# had a User-Agent field, but lots of hams didn't either,
# and the spam probability of User-Agent was very close to
# 0.5 (== not a valuable discriminator) across all
# training sets.
for field in ('x-mailer',):
prefix = field + ':'
x = msg.get(field, 'none').lower()
yield prefix + ' '.join(x.split())
# Organization:
# Oddly enough, tokenizing this doesn't make any difference to
# results. However, noting its mere absence is strong enough
# to give a tiny improvement in the f-n rate, and since
# recording that requires only one token across the whole
# database, the cost is also tiny.
if msg.get('organization', None) is None:
yield "bool:noorg"
# XXX Following is a great idea due to Anthony Baxter. I can't use it
# XXX on my test data because the header lines are so different between
# XXX my ham and spam that it makes a large improvement for bogus
# XXX reasons. So it's commented out. But it's clearly a good thing
# XXX to do on "normal" data, and subsumes the Organization trick above
# XXX in a much more general way, yet at comparable cost.
# X-UIDL:
# Anthony Baxter's idea. This has spamprob 0.99! The value
# is clearly irrelevant, just the presence or absence matters.
# However, it's extremely rare in my spam sets, so doesn't
# have much value.
#
# As also suggested by Anthony, we can capture all such header
# oddities just by generating tags for the count of how many
# times each header field appears.
##x2n = {}
##for x in msg.keys():
## x2n[x] = x2n.get(x, 0) + 1
##for x in x2n.items():
## yield "header:%s:%d" % x
def tokenize_body(self, msg):
# Find, decode (base64, qp), and tokenize textual parts of the body.
for part in textparts(msg):
# Decode, or take it as-is if decoding fails.
try:
text = part.get_payload(decode=True)
except:
yield "control: couldn't decode"
text = part.get_payload(decode=False)
if text is None:
yield 'control: payload is None'
continue
# Normalize case.
text = text.lower()
# Special tagging of embedded URLs.
for proto, guts in url_re.findall(text):
yield "proto:" + proto
# Lose the trailing punctuation for casual embedding, like:
# The code is at http://mystuff.org/here? Didn't resolve.
# or
# I found it at http://mystuff.org/there/. Thanks!
assert guts
while guts and guts[-1] in '.:?!/':
guts = guts[:-1]
for i, piece in enumerate(guts.split('/')):
prefix = "%s%s:" % (proto, i < 2 and str(i) or '>1')
for chunk in urlsep_re.split(piece):
yield prefix + chunk
# Remove HTML/XML tags if it's a plain text message.
if part.get_content_type() == "text/plain":
text = html_re.sub(' ', text)
# Tokenize everything in the body.
for w in text.split():
n = len(w)
# Make sure this range matches in tokenize_word().
if 3 <= n <= 12:
yield w
elif n >= 3:
for t in tokenize_word(w):
yield t
tokenize = Tokenizer().tokenize
Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.8
retrieving revision 1.9
diff -C2 -d -r1.8 -r1.9
*** README.txt 7 Sep 2002 05:51:05 -0000 1.8
--- README.txt 7 Sep 2002 16:14:09 -0000 1.9
***************
*** 37,44 ****
A concrete test driver like timtest.py (see below), but working
with a pair of mailbox files rather than the specialized timtest
! setup. Note that the validity of results from mboxtest.py have
! yet to be confirmed.
! timtoken.py
An implementation of tokenize() that Tim can't seem to help but keep
working on .
--- 37,43 ----
A concrete test driver like timtest.py (see below), but working
with a pair of mailbox files rather than the specialized timtest
! setup.
! tokenizer.py
An implementation of tokenize() that Tim can't seem to help but keep
working on .
From jhylton@users.sourceforge.net Sat Sep 7 17:15:47 2002
From: jhylton@users.sourceforge.net (Jeremy Hylton)
Date: Sat, 07 Sep 2002 09:15:47 -0700
Subject: [Spambayes-checkins]
spambayes hammie.py,1.11,1.12 setup.py,1.2,1.3 timtest.py,1.11,1.12
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv17725
Modified Files:
hammie.py setup.py timtest.py
Log Message:
Use tokenizer module.
XXX Watch out, Tim! I just change timtest out from under you.
Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.11
retrieving revision 1.12
diff -C2 -d -r1.11 -r1.12
*** hammie.py 7 Sep 2002 06:18:03 -0000 1.11
--- hammie.py 7 Sep 2002 16:15:45 -0000 1.12
***************
*** 47,51 ****
# Tim's tokenizer kicks far more booty than anything I would have
# written. Score one for analysis ;)
! from timtoken import tokenize
class DBDict:
--- 47,51 ----
# Tim's tokenizer kicks far more booty than anything I would have
# written. Score one for analysis ;)
! from tokenizer import tokenize
class DBDict:
Index: setup.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/setup.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** setup.py 7 Sep 2002 05:52:48 -0000 1.2
--- setup.py 7 Sep 2002 16:15:45 -0000 1.3
***************
*** 4,8 ****
name='spambayes',
scripts=['unheader.py', 'hammie.py'],
! py_modules=['classifier', 'timtoken']
)
--- 4,8 ----
name='spambayes',
scripts=['unheader.py', 'hammie.py'],
! py_modules=['classifier', 'tokenizer']
)
Index: timtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtest.py,v
retrieving revision 1.11
retrieving revision 1.12
diff -C2 -d -r1.11 -r1.12
*** timtest.py 7 Sep 2002 05:11:31 -0000 1.11
--- timtest.py 7 Sep 2002 16:15:45 -0000 1.12
***************
*** 13,17 ****
import Tester
import classifier
! from timtoken import tokenize
class Hist:
--- 13,17 ----
import Tester
import classifier
! from tokenizer import tokenize
class Hist:
***************
*** 63,67 ****
print "prob(%r) = %g" % clue
print
! guts = msg.guts
if charlimit is not None:
guts = guts[:charlimit]
--- 63,67 ----
print "prob(%r) = %g" % clue
print
! guts = str(msg)
if charlimit is not None:
guts = guts[:charlimit]
***************
*** 86,89 ****
--- 86,92 ----
return self.tag == other.tag
+ def __str__(self):
+ return self.guts
+
class MsgStream(object):
def __init__(self, directory):
***************
*** 153,157 ****
printhist("all runs:", self.global_ham_hist, self.global_spam_hist)
! def test(self, ham, spam):
c = self.classifier
t = self.tester
--- 156,160 ----
printhist("all runs:", self.global_ham_hist, self.global_spam_hist)
! def test(self, ham, spam, charlimit=None):
c = self.classifier
t = self.tester
***************
*** 168,172 ****
print "Low prob spam!", prob
prob, clues = c.spamprob(msg, True)
! printmsg(msg, prob, clues)
t.reset_test_results()
--- 171,175 ----
print "Low prob spam!", prob
prob, clues = c.spamprob(msg, True)
! printmsg(msg, prob, clues, charlimit)
t.reset_test_results()
***************
*** 185,189 ****
print '*' * 78
prob, clues = c.spamprob(e, True)
! printmsg(e, prob, clues)
newfneg = Set(t.false_negatives()) - self.falseneg
--- 188,192 ----
print '*' * 78
prob, clues = c.spamprob(e, True)
! printmsg(e, prob, clues, charlimit)
newfneg = Set(t.false_negatives()) - self.falseneg
From jhylton@users.sourceforge.net Sat Sep 7 17:17:21 2002
From: jhylton@users.sourceforge.net (Jeremy Hylton)
Date: Sat, 07 Sep 2002 09:17:21 -0700
Subject: [Spambayes-checkins] spambayes mboxtest.py,1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv18055
Modified Files:
mboxtest.py
Log Message:
A bunch of unrelated updates.
Add docstring.
Use tokenizer module.
Add MyTokenizer that knows less about how to deal with headers.
Add custom __str__() to MboxMsg to surpress boring headers.
Index: mboxtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/mboxtest.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** mboxtest.py 6 Sep 2002 19:26:34 -0000 1.1
--- mboxtest.py 7 Sep 2002 16:17:19 -0000 1.2
***************
*** 1,5 ****
#! /usr/bin/env python
! from timtoken import tokenize
from classifier import GrahamBayes
from Tester import Test
--- 1,26 ----
#! /usr/bin/env python
+ """mboxtest.py: A test driver for classifier.
! Usage: mboxtest.py [options]
!
! Options:
! -f FMT
! One of unix, mmdf, mh, or qmail. Specifies mailbox format for
! ham and spam files. Default is unix.
!
! -n NSETS
! Number of test sets to create for a single mailbox. Default is 5.
!
! -s SEED
! Seed for random number generator. Default is 101.
!
! -m MSGS
! Read no more than MSGS messages from mailbox.
!
! -l LIMIT
! Print no more than LIMIT characters of a message in test output.
! """
!
! from tokenizer import Tokenizer, subject_word_re, tokenize_word, tokenize
from classifier import GrahamBayes
from Tester import Test
***************
*** 18,21 ****
--- 39,58 ----
}
+ class MyTokenizer(Tokenizer):
+
+ skip = {'received': 1,
+ 'date': 1,
+ 'x-from_': 1,
+ }
+
+ def tokenize_headers(self, msg):
+ for k, v in msg.items():
+ k = k.lower()
+ if k in self.skip or k.startswith('x-vm'):
+ continue
+ for w in subject_word_re.findall(v):
+ for t in tokenize_word(w):
+ yield "%s:%s" % (k, t)
+
class MboxMsg(Msg):
***************
*** 24,27 ****
--- 61,86 ----
self.tag = "%s:%s %s" % (path, index, subject(self.guts))
+ def __str__(self):
+ lines = []
+ i = 0
+ for line in self.guts.split("\n"):
+ skip = False
+ for skip_prefix in 'X-', 'Received:', '\t',:
+ if line.startswith(skip_prefix):
+ skip = True
+ if skip:
+ continue
+ i += 1
+ if i > 100:
+ lines.append("... truncated")
+ break
+ lines.append(line)
+ return "\n".join(lines)
+
+ ## tokenize = MyTokenizer().tokenize
+
+ def __iter__(self):
+ return tokenize(self.guts)
+
class mbox(object):
***************
*** 77,82 ****
NSETS = 5
SEED = 101
! LIMIT = None
! opts, args = getopt.getopt(args, "f:n:s:l:")
for k, v in opts:
if k == '-f':
--- 136,142 ----
NSETS = 5
SEED = 101
! MAXMSGS = None
! CHARLIMIT = 1000
! opts, args = getopt.getopt(args, "f:n:s:l:m:")
for k, v in opts:
if k == '-f':
***************
*** 87,91 ****
SEED = int(v)
if k == '-l':
! LIMIT = int(v)
ham, spam = args
--- 147,153 ----
SEED = int(v)
if k == '-l':
! CHARLIMIT = int(v)
! if k == '-m':
! MAXMSGS = int(v)
ham, spam = args
***************
*** 96,102 ****
nspam = len(list(mbox(spam)))
! if LIMIT:
! nham = min(nham, LIMIT)
! nspam = min(nspam, LIMIT)
print "ham", ham, nham
--- 158,164 ----
nspam = len(list(mbox(spam)))
! if MAXMSGS:
! nham = min(nham, MAXMSGS)
! nspam = min(nspam, MAXMSGS)
print "ham", ham, nham
***************
*** 115,120 ****
if (iham, ispam) == (ihtest, istest):
continue
! driver.test(mbox(ham, ihtest), mbox(spam, istest))
! driver.finish()
driver.alldone()
--- 177,182 ----
if (iham, ispam) == (ihtest, istest):
continue
! driver.test(mbox(ham, ihtest), mbox(spam, istest), CHARLIMIT)
! driver.finishtest()
driver.alldone()
From jhylton@users.sourceforge.net Sat Sep 7 17:39:06 2002
From: jhylton@users.sourceforge.net (Jeremy Hylton)
Date: Sat, 07 Sep 2002 09:39:06 -0700
Subject: [Spambayes-checkins] spambayes rates.py,1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv22298
Modified Files:
rates.py
Log Message:
Change to work with mboxtest.py output.
Index: rates.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/rates.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** rates.py 5 Sep 2002 23:34:41 -0000 1.1
--- rates.py 7 Sep 2002 16:39:04 -0000 1.2
***************
*** 27,31 ****
new false positives: ['Data/Ham/Set2/66645.txt']
"""
! pat1 = re.compile(r'\s*Training on Data/').match
pat2 = re.compile(r'\s+false (positive|negative): (.*)').match
pat3 = re.compile(r"\s+new false (positives|negatives): \[(.+)\]").match
--- 27,31 ----
new false positives: ['Data/Ham/Set2/66645.txt']
"""
! pat1 = re.compile(r'\s*Training on ').match
pat2 = re.compile(r'\s+false (positive|negative): (.*)').match
pat3 = re.compile(r"\s+new false (positives|negatives): \[(.+)\]").match
From rubiconx@users.sourceforge.net Sat Sep 7 18:12:24 2002
From: rubiconx@users.sourceforge.net (Neale Pickett)
Date: Sat, 07 Sep 2002 10:12:24 -0700
Subject: [Spambayes-checkins] spambayes hammie.py,1.12,1.13
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv30001
Modified Files:
hammie.py
Log Message:
New DEFAULTDB global variable, updated usage docstring.
Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.12
retrieving revision 1.13
diff -C2 -d -r1.12 -r1.13
*** hammie.py 7 Sep 2002 16:15:45 -0000 1.12
--- hammie.py 7 Sep 2002 17:12:22 -0000 1.13
***************
*** 2,8 ****
# At the moment, this requires Python 2.3 from CVS
! # A driver for the classifier module. Currently mostly a wrapper around
! # existing stuff. Neale Pickett is the person to
! # blame for this.
"""Usage: %(program)s [options]
--- 2,7 ----
# At the moment, this requires Python 2.3 from CVS
! # A driver for the classifier module and Tim's tokenizer that you can
! # call from procmail.
"""Usage: %(program)s [options]
***************
*** 19,23 ****
-p FILE
use file as the persistent store. loads data from this file if it
! exists, and saves data to this file at the end. Default: hammie.db
-d
use the DBM store instead of cPickle. The file is larger and
--- 18,22 ----
-p FILE
use file as the persistent store. loads data from this file if it
! exists, and saves data to this file at the end. Default: %(DEFAULTDB)s
-d
use the DBM store instead of cPickle. The file is larger and
***************
*** 26,30 ****
-f
run as a filter: read a single message from stdin, add an
! X-Spam-Disposition header, and write it to stdout.
"""
--- 25,29 ----
-f
run as a filter: read a single message from stdin, add an
! %(DISPHEADER)s header, and write it to stdout.
"""
***************
*** 45,48 ****
--- 44,50 ----
DISPHEADER = "X-Hammie-Disposition"
+ # Default database name
+ DEFAULTDB = "hammie.db"
+
# Tim's tokenizer kicks far more booty than anything I would have
# written. Score one for analysis ;)
***************
*** 278,282 ****
usage(2, "No options given")
! pck = "hammie.db"
good = spam = unknown = None
do_filter = usedb = False
--- 280,284 ----
usage(2, "No options given")
! pck = DEFAULTDB
good = spam = unknown = None
do_filter = usedb = False
From tim_one@users.sourceforge.net Sat Sep 7 19:22:02 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sat, 07 Sep 2002 11:22:02 -0700
Subject: [Spambayes-checkins] spambayes README.txt,1.9,1.10
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv16167
Modified Files:
README.txt
Log Message:
Some rearrangement.
Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.9
retrieving revision 1.10
diff -C2 -d -r1.9 -r1.10
*** README.txt 7 Sep 2002 16:14:09 -0000 1.9
--- README.txt 7 Sep 2002 18:22:00 -0000 1.10
***************
*** 31,35 ****
hammie.py
! A spamassassin-like filter which uses timtoken (below) and
classifier (above). Needs to be made faster, especially for writes.
--- 31,35 ----
hammie.py
! A spamassassin-like filter which uses tokenizer (below) and
classifier (above). Needs to be made faster, especially for writes.
***************
*** 49,56 ****
tokenize() function of your choosing.
- unheader.py
- A script to remove unwanted headers from an mbox file. This is mostly
- useful to delete headers which incorrectly might bias the results.
-
GBayes.py
A number of tokenizers and a partial test driver. This assumes
--- 49,52 ----
***************
*** 73,84 ****
Test Data Utilities
===================
- rebal.py
- Evens out the number of messages in "standard" test data folders (see
- below).
-
cleanarch
A script to repair mbox archives by finding "From" lines that
should have been escaped, and escaping them.
mboxcount.py
Count the number of messages (both parseable and unparseable) in
--- 69,80 ----
Test Data Utilities
===================
cleanarch
A script to repair mbox archives by finding "From" lines that
should have been escaped, and escaping them.
+ unheader.py
+ A script to remove unwanted headers from an mbox file. This is mostly
+ useful to delete headers which incorrectly might bias the results.
+
mboxcount.py
Count the number of messages (both parseable and unparseable) in
***************
*** 89,92 ****
--- 85,92 ----
Split an mbox into random pieces in various ways. Tim recommends
using "the standard" test data set up instead (see below).
+
+ rebal.py
+ Evens out the number of messages in "standard" test data folders (see
+ below).
From tim_one@users.sourceforge.net Sat Sep 7 19:38:13 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sat, 07 Sep 2002 11:38:13 -0700
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.1,1.2
timtoken.py,1.8,NONE
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv19837
Modified Files:
tokenizer.py
Removed Files:
timtoken.py
Log Message:
Removed timtoken.py from the project. tokenizer.py is essentially a
copy, but of a somewhat out-of-date version of timtoken at the time
it was introduced. The differences are all in comments, and I found
those and put them back into tokenizer.py.
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** tokenizer.py 7 Sep 2002 16:14:09 -0000 1.1
--- tokenizer.py 7 Sep 2002 18:38:10 -0000 1.2
***************
*** 352,355 ****
--- 352,375 ----
# XXX not to strip HTML from HTML-only msgs should be revisited.
+ ##############################################################################
+ # How big should "a word" be?
+ #
+ # As I write this, words less than 3 chars are ignored completely, and words
+ # with more than 12 are special-cased, replaced with a summary "I skipped
+ # about so-and-so many chars starting with such-and-such a letter" token.
+ # This makes sense for English if most of the info is in "regular size"
+ # words.
+ #
+ # A test run boosting to 13 had no effect on f-p rate, and did a little
+ # better or worse than 12 across runs -- overall, no significant difference.
+ # The database size is smaller at 12, so there's nothing in favor of 13.
+ # A test at 11 showed a slight but consistent bad effect on the f-n rate
+ # (lost 12 times, won once, tied 7 times).
+ #
+ # A test with no lower bound showed a significant increase in the f-n rate.
+ # Curious, but not worth digging into. Boosting the lower bound to 4 is a
+ # worse idea: f-p and f-n rates both suffered significantly then. I didn't
+ # try testing with lower bound 2.
+
url_re = re.compile(r"""
(https? | ftp) # capture the protocol
***************
*** 383,392 ****
n = _len(word)
- # XXX How big should "a word" be?
- # XXX I expect 12 is fine -- a test run boosting to 13 had no effect
- # XXX on f-p rate, and did a little better or worse than 12 across
- # XXX runs -- overall, no significant difference. It's only "common
- # XXX sense" so far driving the exclusion of lengths 1 and 2.
-
# Make sure this range matches in tokenize().
if 3 <= n <= 12:
--- 403,406 ----
***************
*** 449,453 ****
#
# A bug in this code prevented Content-Transfer-Encoding from getting
! # picked up. Fixing that bug showed that it didn't helpe, so the corrected
# code is disabled now (left column without Content-Transfer-Encoding,
# right column with it);
--- 463,467 ----
#
# A bug in this code prevented Content-Transfer-Encoding from getting
! # picked up. Fixing that bug showed that it didn't help, so the corrected
# code is disabled now (left column without Content-Transfer-Encoding,
# right column with it);
***************
*** 567,571 ****
def tokenize_headers(self, msg):
# Special tagging of header lines.
!
# XXX TODO Neil Schemenauer has gotten a good start on this
# XXX (pvt email). The headers in my spam and ham corpora are
--- 581,585 ----
def tokenize_headers(self, msg):
# Special tagging of header lines.
!
# XXX TODO Neil Schemenauer has gotten a good start on this
# XXX (pvt email). The headers in my spam and ham corpora are
--- timtoken.py DELETED ---
From tim_one@users.sourceforge.net Sat Sep 7 20:44:34 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sat, 07 Sep 2002 12:44:34 -0700
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.2,1.3
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv2928
Modified Files:
tokenizer.py
Log Message:
Added Neil Schemenauer's IP tokenization of Received: headers,
unfortunately disabled for now.
Moved textparts() below the massive comments at the start.
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** tokenizer.py 7 Sep 2002 18:38:10 -0000 1.2
--- tokenizer.py 7 Sep 2002 19:44:31 -0000 1.3
***************
*** 5,44 ****
from sets import Set
- # Find all the text components of the msg. There's no point decoding
- # binary blobs (like images). If a multipart/alternative has both plain
- # text and HTML versions of a msg, ignore the HTML part: HTML decorations
- # have monster-high spam probabilities, and innocent newbies often post
- # using HTML.
- def textparts(msg):
- text = Set()
- redundant_html = Set()
- for part in msg.walk():
- if part.get_content_type() == 'multipart/alternative':
- # Descend this part of the tree, adding any redundant HTML text
- # part to redundant_html.
- htmlpart = textpart = None
- stack = part.get_payload()
- while stack:
- subpart = stack.pop()
- ctype = subpart.get_content_type()
- if ctype == 'text/plain':
- textpart = subpart
- elif ctype == 'text/html':
- htmlpart = subpart
- elif ctype == 'multipart/related':
- stack.extend(subpart.get_payload())
-
- if textpart is not None:
- text.add(textpart)
- if htmlpart is not None:
- redundant_html.add(htmlpart)
- elif htmlpart is not None:
- text.add(htmlpart)
-
- elif part.get_content_maintype() == 'text':
- text.add(part)
-
- return text - redundant_html
-
##############################################################################
# To fold case or not to fold case? I didn't want to fold case, because
--- 5,8 ----
***************
*** 372,375 ****
--- 336,377 ----
# try testing with lower bound 2.
+
+
+ # Find all the text components of the msg. There's no point decoding
+ # binary blobs (like images). If a multipart/alternative has both plain
+ # text and HTML versions of a msg, ignore the HTML part: HTML decorations
+ # have monster-high spam probabilities, and innocent newbies often post
+ # using HTML.
+ def textparts(msg):
+ text = Set()
+ redundant_html = Set()
+ for part in msg.walk():
+ if part.get_content_type() == 'multipart/alternative':
+ # Descend this part of the tree, adding any redundant HTML text
+ # part to redundant_html.
+ htmlpart = textpart = None
+ stack = part.get_payload()
+ while stack:
+ subpart = stack.pop()
+ ctype = subpart.get_content_type()
+ if ctype == 'text/plain':
+ textpart = subpart
+ elif ctype == 'text/html':
+ htmlpart = subpart
+ elif ctype == 'multipart/related':
+ stack.extend(subpart.get_payload())
+
+ if textpart is not None:
+ text.add(textpart)
+ if htmlpart is not None:
+ redundant_html.add(htmlpart)
+ elif htmlpart is not None:
+ text.add(htmlpart)
+
+ elif part.get_content_maintype() == 'text':
+ text.add(part)
+
+ return text - redundant_html
+
url_re = re.compile(r"""
(https? | ftp) # capture the protocol
***************
*** 393,396 ****
--- 395,400 ----
""", re.VERBOSE)
+ ip_re = re.compile(r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})')
+
# I'm usually just splitting on whitespace, but for subject lines I want to
# break things like "Python/Perl comparison?" up. OTOH, I don't want to
***************
*** 640,643 ****
--- 644,660 ----
if msg.get('organization', None) is None:
yield "bool:noorg"
+
+ # Received:
+ # Neil Schemenauer reported good results from tokenizing prefixes
+ # of the embedded IP addresses.
+ # XXX This is disabled only because it's "too good" when used on
+ # XXX Tim's mixed-source corpora.
+ if 0:
+ for header in msg.get_all("received", ()):
+ for ip in ip_re.findall(header):
+ parts = ip.split(".")
+ for n in range(1, 5):
+ yield 'received:' + '.'.join(parts[:n])
+
# XXX Following is a great idea due to Anthony Baxter. I can't use it
From gvanrossum@users.sourceforge.net Sun Sep 8 03:59:45 2002
From: gvanrossum@users.sourceforge.net (Guido van Rossum)
Date: Sat, 07 Sep 2002 19:59:45 -0700
Subject: [Spambayes-checkins] spambayes hammie.py,1.13,1.14
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv31886
Modified Files:
hammie.py
Log Message:
Make -u only print the spams.
Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.13
retrieving revision 1.14
diff -C2 -d -r1.13 -r1.14
*** hammie.py 7 Sep 2002 17:12:22 -0000 1.13
--- hammie.py 8 Sep 2002 02:59:43 -0000 1.14
***************
*** 253,263 ****
prob, clues = bayes.spamprob(tokenize(msg), True)
isspam = prob >= 0.9
- print "%6d %4.2f %1s" % (i, prob, isspam and "S" or "."),
if isspam:
spams += 1
print formatclues(clues)
else:
hams += 1
- print
print "Total %d spam, %d ham" % (spams, hams)
--- 253,262 ----
prob, clues = bayes.spamprob(tokenize(msg), True)
isspam = prob >= 0.9
if isspam:
spams += 1
+ print "%6s %4.2f %1s" % (i, prob, isspam and "S" or "."),
print formatclues(clues)
else:
hams += 1
print "Total %d spam, %d ham" % (spams, hams)
From tim_one@users.sourceforge.net Sun Sep 8 04:17:33 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sat, 07 Sep 2002 20:17:33 -0700
Subject: [Spambayes-checkins] spambayes classifier.py,1.4,1.5
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv2722
Modified Files:
classifier.py
Log Message:
spamprob(): If the caller asked for the clues ( pairs),
sort them by prob.
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** classifier.py 7 Sep 2002 05:11:30 -0000 1.4
--- classifier.py 8 Sep 2002 03:17:31 -0000 1.5
***************
*** 323,326 ****
--- 323,327 ----
prob = prob_product / (prob_product + inverse_prob_product)
if evidence:
+ clues.sort(lambda a, b: cmp(a[1], b[1]))
return prob, clues
else:
***************
*** 559,562 ****
--- 560,571 ----
elif prob > MAX_SPAMPROB:
prob = MAX_SPAMPROB
+
+
+ ## if prob != 0.5:
+ ## confbias = 0.01 / (record.hamcount + record.spamcount)
+ ## if prob > 0.5:
+ ## prob = max(0.5, prob - confbias)
+ ## else:
+ ## prob = min(0.5, prob + confbias)
if record.spamprob != prob:
From gvanrossum@users.sourceforge.net Sun Sep 8 04:20:20 2002
From: gvanrossum@users.sourceforge.net (Guido van Rossum)
Date: Sat, 07 Sep 2002 20:20:20 -0700
Subject: [Spambayes-checkins] spambayes hammie.py,1.14,1.15
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv3250
Modified Files:
hammie.py
Log Message:
No need to sort the clues any more (classifier.py does that now).
Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.14
retrieving revision 1.15
diff -C2 -d -r1.14 -r1.15
*** hammie.py 8 Sep 2002 02:59:43 -0000 1.14
--- hammie.py 8 Sep 2002 03:20:18 -0000 1.15
***************
*** 226,232 ****
def formatclues(clues, sep="; "):
"""Format the clues into something readable."""
! lst = [(prob, word) for word, prob in clues]
! lst.sort()
! return sep.join(["%r: %.2f" % (word, prob) for prob, word in lst])
def filter(bayes, input, output):
--- 226,230 ----
def formatclues(clues, sep="; "):
"""Format the clues into something readable."""
! return sep.join(["%r: %.2f" % (word, prob) for word, prob in clues])
def filter(bayes, input, output):
From tim_one@users.sourceforge.net Sun Sep 8 09:08:04 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 08 Sep 2002 01:08:04 -0700
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.3,1.4
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv24720
Modified Files:
tokenizer.py
Log Message:
Add results from latest experiments with tokenization and HTML stripping.
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** tokenizer.py 7 Sep 2002 19:44:31 -0000 1.3
--- tokenizer.py 8 Sep 2002 08:08:02 -0000 1.4
***************
*** 205,209 ****
#
# total unique fn went from 292 to 302
!
##############################################################################
--- 205,299 ----
#
# total unique fn went from 292 to 302
! #
! # Later: Here's another tokenization scheme with more promise.
! #
! # fold case, ignore punctuation, strip a trailing 's' from words (to
! # stop Guido griping about "hotel" and "hotels" getting scored as
! # distinct clues ) and save both word bigrams and word unigrams
! #
! # This was the code:
! #
! # # Tokenize everything in the body.
! # lastw = ''
! # for w in word_re.findall(text):
! # n = len(w)
! # # Make sure this range matches in tokenize_word().
! # if 3 <= n <= 12:
! # if w[-1] == 's':
! # w = w[:-1]
! # yield w
! # if lastw:
! # yield lastw + w
! # lastw = w + ' '
! #
! # elif n >= 3:
! # lastw = ''
! # for t in tokenize_word(w):
! # yield t
! #
! # where
! #
! # word_re = re.compile(r"[\w$\-\x80-\xff]+")
! #
! # This at least doubled the process size. It helped the f-n rate
! # significantly, but probably hurt the f-p rate (the f-p rate is too low
! # with only 4000 hams per run to be confident about changes of such small
! # *absolute* magnitude -- 0.025% is a single message in the f-p table):
! #
! # false positive percentages
! # 0.000 0.000 tied
! # 0.000 0.075 lost +(was 0)
! # 0.050 0.125 lost +150.00%
! # 0.025 0.000 won -100.00%
! # 0.075 0.025 won -66.67%
! # 0.000 0.050 lost +(was 0)
! # 0.100 0.175 lost +75.00%
! # 0.050 0.050 tied
! # 0.025 0.050 lost +100.00%
! # 0.025 0.000 won -100.00%
! # 0.050 0.125 lost +150.00%
! # 0.050 0.025 won -50.00%
! # 0.050 0.050 tied
! # 0.000 0.025 lost +(was 0)
! # 0.000 0.025 lost +(was 0)
! # 0.075 0.050 won -33.33%
! # 0.025 0.050 lost +100.00%
! # 0.000 0.000 tied
! # 0.025 0.100 lost +300.00%
! # 0.050 0.150 lost +200.00%
! #
! # won 5 times
! # tied 4 times
! # lost 11 times
! #
! # total unique fp went from 13 to 21
! #
! # false negative percentages
! # 0.327 0.218 won -33.33%
! # 0.400 0.218 won -45.50%
! # 0.327 0.218 won -33.33%
! # 0.691 0.691 tied
! # 0.545 0.327 won -40.00%
! # 0.291 0.218 won -25.09%
! # 0.218 0.291 lost +33.49%
! # 0.654 0.473 won -27.68%
! # 0.364 0.327 won -10.16%
! # 0.291 0.182 won -37.46%
! # 0.327 0.254 won -22.32%
! # 0.691 0.509 won -26.34%
! # 0.582 0.473 won -18.73%
! # 0.291 0.255 won -12.37%
! # 0.364 0.218 won -40.11%
! # 0.436 0.327 won -25.00%
! # 0.436 0.473 lost +8.49%
! # 0.218 0.218 tied
! # 0.291 0.255 won -12.37%
! # 0.254 0.364 lost +43.31%
! #
! # won 15 times
! # tied 2 times
! # lost 3 times
! #
! # total unique fn went from 106 to 94
##############################################################################
***************
*** 313,318 ****
# do that part. However, even after stripping tags, the rates above show that
# at least 98% of spams are still correctly identified as spam.
! # XXX So, if another way is found to slash the f-n rate, the decision here
! # XXX not to strip HTML from HTML-only msgs should be revisited.
##############################################################################
--- 403,471 ----
# do that part. However, even after stripping tags, the rates above show that
# at least 98% of spams are still correctly identified as spam.
! #
! # So, if another way is found to slash the f-n rate, the decision here not
! # to strip HTML from HTML-only msgs should be revisited.
! #
! # Later, after the f-n rate got slashed via other means:
! #
! # false positive percentages
! # 0.000 0.000 tied
! # 0.000 0.000 tied
! # 0.050 0.075 lost +50.00%
! # 0.025 0.025 tied
! # 0.075 0.025 won -66.67%
! # 0.000 0.000 tied
! # 0.100 0.100 tied
! # 0.050 0.075 lost +50.00%
! # 0.025 0.025 tied
! # 0.025 0.000 won -100.00%
! # 0.050 0.075 lost +50.00%
! # 0.050 0.050 tied
! # 0.050 0.025 won -50.00%
! # 0.000 0.000 tied
! # 0.000 0.000 tied
! # 0.075 0.075 tied
! # 0.025 0.025 tied
! # 0.000 0.000 tied
! # 0.025 0.025 tied
! # 0.050 0.050 tied
! #
! # won 3 times
! # tied 14 times
! # lost 3 times
! #
! # total unique fp went from 13 to 11
! #
! # false negative percentages
! # 0.327 0.400 lost +22.32%
! # 0.400 0.400 tied
! # 0.327 0.473 lost +44.65%
! # 0.691 0.654 won -5.35%
! # 0.545 0.473 won -13.21%
! # 0.291 0.364 lost +25.09%
! # 0.218 0.291 lost +33.49%
! # 0.654 0.654 tied
! # 0.364 0.473 lost +29.95%
! # 0.291 0.327 lost +12.37%
! # 0.327 0.291 won -11.01%
! # 0.691 0.654 won -5.35%
! # 0.582 0.655 lost +12.54%
! # 0.291 0.400 lost +37.46%
! # 0.364 0.436 lost +19.78%
! # 0.436 0.582 lost +33.49%
! # 0.436 0.364 won -16.51%
! # 0.218 0.291 lost +33.49%
! # 0.291 0.400 lost +37.46%
! # 0.254 0.327 lost +28.74%
! #
! # won 5 times
! # tied 2 times
! # lost 13 times
! #
! # total unique fn went from 106 to 122
! #
! # So HTML decorations are still a significant clue when the ham is composed
! # of c.l.py traffic. Again, this should be revisited if the f-n rate is
! # slashed again.
##############################################################################
From nascheme@users.sourceforge.net Sun Sep 8 13:55:36 2002
From: nascheme@users.sourceforge.net (Neil Schemenauer)
Date: Sun, 08 Sep 2002 05:55:36 -0700
Subject: [Spambayes-checkins] spambayes splitndirs.py,NONE,1.1
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv14232
Added Files:
splitndirs.py
Log Message:
Like splitn.py but puts each message in a file, suitable for timtest.py. I
don't know what the assert is trying to do and it fails on my spam box so I
left it out.
--- NEW FILE: splitndirs.py ---
#! /usr/bin/env python
"""Split an mbox into N random directories of files.
Usage: %(program)s [-h] [-s seed] [-v] -n N sourcembox outdirbase
Options:
-h / --help
Print this help message and exit
-s seed
Seed the random number generator with seed (an integer).
By default, use system time at startup to seed.
-v
Verbose. Displays a period for each 100 messages parsed.
May display other stuff.
-n N
The number of output mboxes desired. This is required.
Arguments:
sourcembox
The mbox to split.
outdirbase
The base path + name prefix for each of the N output dirs.
Output files have names of the form
outdirbase + ("Set%%d/%%d" %% (i, n))
Example:
%(program)s -s 123 -n5 Data/spam.mbox Data/Spam/Set
produces 5 directories, named Data/Spam/Set1 through Data/Spam/Set5. Each
contains a random selection of the messages in spam.mbox, and together
they contain every message in spam.mbox exactly once. Each has
approximately the same number of messages. spam.mbox is not altered. In
addition, the seed for the random number generator is forced to 123, so
that while the split is random, it's reproducible.
"""
import sys
import os
import random
import mailbox
import email
import getopt
program = sys.argv[0]
def usage(code, msg=''):
print >> sys.stderr, __doc__ % globals()
if msg:
print >> sys.stderr, msg
sys.exit(code)
def _factory(fp):
try:
return email.message_from_file(fp)
except email.Errors.MessageParseError:
return ''
def main():
try:
opts, args = getopt.getopt(sys.argv[1:], 'hn:s:v', ['help'])
except getopt.error, msg:
usage(1, msg)
n = None
verbose = False
for opt, arg in opts:
if opt in ('-h', '--help'):
usage(0)
elif opt == '-s':
random.seed(int(arg))
elif opt == '-n':
n = int(arg)
elif opt == '-v':
verbose = True
if n is None or n <= 1:
usage(1, "an -n value > 1 is required")
if len(args) != 2:
usage(1, "input mbox name and output base path are required")
inputpath, outputbasepath = args
infile = file(inputpath, 'rb')
outdirs = [outputbasepath + ("%d" % i) for i in range(1, n+1)]
for dir in outdirs:
if not os.path.isdir(dir):
os.makedirs(dir)
mbox = mailbox.PortableUnixMailbox(infile, _factory)
counter = 0
for msg in mbox:
i = random.randrange(n)
astext = str(msg)
#assert astext.endswith('\n')
counter += 1
msgfile = open('%s/%d' % (outdirs[i], counter), 'wb')
msgfile.write(astext)
msgfile.close()
if verbose:
if counter % 100 == 0:
print '.',
if verbose:
print
print counter, "messages split into", n, "directories"
infile.close()
if __name__ == '__main__':
main()
From nascheme@users.sourceforge.net Sun Sep 8 18:10:06 2002
From: nascheme@users.sourceforge.net (Neil Schemenauer)
Date: Sun, 08 Sep 2002 10:10:06 -0700
Subject: [Spambayes-checkins] spambayes cmp.py,1.2,1.3
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv10946
Modified Files:
cmp.py
Log Message:
make work for NSETS != 5
Index: cmp.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/cmp.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** cmp.py 6 Sep 2002 04:25:45 -0000 1.2
--- cmp.py 8 Sep 2002 17:10:03 -0000 1.3
***************
*** 10,15 ****
f1n, f2n = sys.argv[1:3]
- NSETS = 5
-
# Return
# (list of all f-p rates,
--- 10,13 ----
***************
*** 21,29 ****
fns = []
fps = []
! for block in range(NSETS):
! # Skip, e.g.,
! # Training on Data/Ham/Set1 & Data/Spam/Set1 ... 4000 hams & 2750 spams
! f.readline()
! for inner in range(NSETS - 1):
# A line with an f-p rate and an f-n rate.
p, n = map(float, f.readline().split())
--- 19,27 ----
fns = []
fps = []
! while 1:
! line = f.readline()
! if line.startswith('total'):
! break
! if not line.startswith('Training'):
# A line with an f-p rate and an f-n rate.
p, n = map(float, f.readline().split())
***************
*** 33,37 ****
# "total false pos 8 0.04"
# "total false neg 249 1.81090909091"
! fptot = int(f.readline().split()[-2])
fntot = int(f.readline().split()[-2])
return fps, fns, fptot, fntot
--- 31,35 ----
# "total false pos 8 0.04"
# "total false neg 249 1.81090909091"
! fptot = int(line.split()[-2])
fntot = int(f.readline().split()[-2])
return fps, fns, fptot, fntot
From nascheme@users.sourceforge.net Sun Sep 8 18:18:44 2002
From: nascheme@users.sourceforge.net (Neil Schemenauer)
Date: Sun, 08 Sep 2002 10:18:44 -0700
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.4,1.5
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv12815
Modified Files:
tokenizer.py
Log Message:
smarter received header processing. Grab the 'from' hostname and IP and
ignore the rest.
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** tokenizer.py 8 Sep 2002 08:08:02 -0000 1.4
--- tokenizer.py 8 Sep 2002 17:18:41 -0000 1.5
***************
*** 548,552 ****
""", re.VERBOSE)
! ip_re = re.compile(r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})')
# I'm usually just splitting on whitespace, but for subject lines I want to
--- 548,553 ----
""", re.VERBOSE)
! received_host_re = re.compile(r'from (\S+)\s')
! received_ip_re = re.compile(r'\s[[(]((\d{1,3}\.?){4})[\])]')
# I'm usually just splitting on whitespace, but for subject lines I want to
***************
*** 708,711 ****
--- 709,721 ----
yield 'content-transfer-encoding:' + x.lower()
+ def breakdown_host(host):
+ parts = host.split('.')
+ for i in range(1, len(parts) + 1):
+ yield '.'.join(parts[-i:])
+
+ def breakdown_ipaddr(ipaddr):
+ parts = ipaddr.split('.')
+ for i in range(1, 5):
+ yield '.'.join(parts[:i])
class Tokenizer:
***************
*** 805,813 ****
if 0:
for header in msg.get_all("received", ()):
! for ip in ip_re.findall(header):
! parts = ip.split(".")
! for n in range(1, 5):
! yield 'received:' + '.'.join(parts[:n])
!
# XXX Following is a great idea due to Anthony Baxter. I can't use it
--- 815,824 ----
if 0:
for header in msg.get_all("received", ()):
! for pat, breakdown in [(received_host_re, breakdown_host),
! (received_ip_re, breakdown_ipaddr)]:
! m = pat.search(header)
! if m:
! for tok in breakdown(m.group(1).lower()):
! yield 'received:' + tok
# XXX Following is a great idea due to Anthony Baxter. I can't use it
From tim_one@users.sourceforge.net Sun Sep 8 18:41:59 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 08 Sep 2002 10:41:59 -0700
Subject: [Spambayes-checkins] spambayes splitn.py,1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv17562
Modified Files:
splitn.py
Log Message:
Removed pointless assert; it failed for Neil.
Index: splitn.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/splitn.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** splitn.py 5 Sep 2002 16:16:43 -0000 1.1
--- splitn.py 8 Sep 2002 17:41:56 -0000 1.2
***************
*** 94,98 ****
i = random.randrange(n)
astext = str(msg)
- assert astext.endswith('\n')
outfiles[i].write(astext)
counter += 1
--- 94,97 ----
From tim_one@users.sourceforge.net Sun Sep 8 18:46:16 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 08 Sep 2002 10:46:16 -0700
Subject: [Spambayes-checkins] spambayes README.txt,1.10,1.11
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv18658
Modified Files:
README.txt
Log Message:
Blurb about Neil's splitndirs.py.
Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.10
retrieving revision 1.11
diff -C2 -d -r1.10 -r1.11
*** README.txt 7 Sep 2002 18:22:00 -0000 1.10
--- README.txt 8 Sep 2002 17:46:14 -0000 1.11
***************
*** 86,92 ****
using "the standard" test data set up instead (see below).
rebal.py
Evens out the number of messages in "standard" test data folders (see
! below).
--- 86,98 ----
using "the standard" test data set up instead (see below).
+ splitndirs.py
+ Like splitn.py (above), but splits an mbox into one message per file in
+ "the standard" directory structure (see below). This does an
+ approximate split; rebal.by (below) can be used afterwards to even out
+ the number of messages per folder.
+
rebal.py
Evens out the number of messages in "standard" test data folders (see
! below). Needs generalization (e.g., Ham and 4000 are hardcoded now).
From tim_one@users.sourceforge.net Sun Sep 8 18:50:51 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 08 Sep 2002 10:50:51 -0700
Subject: [Spambayes-checkins] spambayes cmp.py,1.3,1.4
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv19567
Modified Files:
cmp.py
Log Message:
dump(): tiny simplification of print format.
Index: cmp.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/cmp.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** cmp.py 8 Sep 2002 17:10:03 -0000 1.3
--- cmp.py 8 Sep 2002 17:50:49 -0000 1.4
***************
*** 55,59 ****
print
for t in "won", "tied", "lost":
! print "%-4s %2d %s" % (t, alltags.count(t), "times")
print
--- 55,59 ----
print
for t in "won", "tied", "lost":
! print "%-4s %2d times" % (t, alltags.count(t))
print
From tim_one@users.sourceforge.net Sun Sep 8 19:21:26 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 08 Sep 2002 11:21:26 -0700
Subject: [Spambayes-checkins] spambayes cmp.py,1.4,1.5
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv26263
Modified Files:
cmp.py
Log Message:
Someone introduced a bug that resulted in half the f-p and f-n rates
getting ignored (every 2nd line of that type got skipped). Repaired it.
Index: cmp.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/cmp.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** cmp.py 8 Sep 2002 17:50:49 -0000 1.4
--- cmp.py 8 Sep 2002 18:21:24 -0000 1.5
***************
*** 25,29 ****
if not line.startswith('Training'):
# A line with an f-p rate and an f-n rate.
! p, n = map(float, f.readline().split())
fps.append(p)
fns.append(n)
--- 25,29 ----
if not line.startswith('Training'):
# A line with an f-p rate and an f-n rate.
! p, n = map(float, line.split())
fps.append(p)
fns.append(n)
From tim_one@users.sourceforge.net Sun Sep 8 19:39:01 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 08 Sep 2002 11:39:01 -0700
Subject: [Spambayes-checkins] spambayes cmp.py,1.5,1.6
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv30785
Modified Files:
cmp.py
Log Message:
Compute and display the %change for total unique fn and fp too.
Index: cmp.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/cmp.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** cmp.py 8 Sep 2002 18:21:24 -0000 1.5
--- cmp.py 8 Sep 2002 18:38:59 -0000 1.6
***************
*** 66,73 ****
print "false positive percentages"
dump(fp1, fp2)
! print "total unique fp went from", fptot1, "to", fptot2
print
print "false negative percentages"
dump(fn1, fn2)
! print "total unique fn went from", fntot1, "to", fntot2
--- 66,73 ----
print "false positive percentages"
dump(fp1, fp2)
! print "total unique fp went from", fptot1, "to", fptot2, tag(fptot1, fptot2)
print
print "false negative percentages"
dump(fn1, fn2)
! print "total unique fn went from", fntot1, "to", fntot2, tag(fntot1, fntot2)
From tim_one@users.sourceforge.net Sun Sep 8 19:54:12 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 08 Sep 2002 11:54:12 -0700
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.5,1.6
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv32497
Modified Files:
tokenizer.py
Log Message:
tokenize(): Stop distinguishing Content-XYZ thingies in the
headers from instances in lower-level MIME sections. In all,
doing so appears to be just another way of warping the
tokenizer to c.l.py's extreme hatred of HTML. For example,
'>content-type:text/plain' (lower-level instance) has a spamprob
of 0.85 in my data, but 'content-type:text/plain' (top-level
instance) has spamprob less than 0.25. A few examples Guido
posted suggest this distinction does more harm on his data
than it does good on mine. On mine, getting rid of the
distinction makes a tiny difference in the f-n rates; note
that an f-n boost from 0.327% to 0.364% represents a single
msg in my ~2750-msg spam sets:
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.050 0.050 tied
0.025 0.025 tied
0.075 0.075 tied
0.000 0.000 tied
0.100 0.075 won -25.00%
0.050 0.075 lost +50.00%
0.025 0.025 tied
0.025 0.025 tied
0.050 0.050 tied
0.050 0.050 tied
0.050 0.050 tied
0.000 0.000 tied
0.000 0.000 tied
0.075 0.075 tied
0.025 0.025 tied
0.000 0.000 tied
0.025 0.025 tied
0.050 0.050 tied
won 1 times
tied 18 times
lost 1 times
total unique fp went from 13 to 12 won -7.69%
false negative percentages
0.327 0.327 tied
0.400 0.400 tied
0.327 0.364 lost +11.31%
0.691 0.691 tied
0.545 0.545 tied
0.291 0.291 tied
0.218 0.291 lost +33.49%
0.654 0.618 won -5.50%
0.364 0.436 lost +19.78%
0.291 0.327 lost +12.37%
0.327 0.364 lost +11.31%
0.691 0.691 tied
0.582 0.618 lost +6.19%
0.291 0.291 tied
0.364 0.291 won -20.05%
0.436 0.436 tied
0.436 0.473 lost +8.49%
0.218 0.218 tied
0.291 0.291 tied
0.254 0.254 tied
won 2 times
tied 11 times
lost 7 times
total unique fn went from 106 to 110 lost +3.77%
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** tokenizer.py 8 Sep 2002 17:18:41 -0000 1.5
--- tokenizer.py 8 Sep 2002 18:54:09 -0000 1.6
***************
*** 757,765 ****
# Content-{Type, Disposition} and their params, and charsets.
- t = ''
for x in msg.walk():
for w in crack_content_xyz(x):
! yield t + w
! t = '>'
# Subject:
--- 757,763 ----
# Content-{Type, Disposition} and their params, and charsets.
for x in msg.walk():
for w in crack_content_xyz(x):
! yield w
# Subject:
From tim_one@users.sourceforge.net Sun Sep 8 22:08:18 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 08 Sep 2002 14:08:18 -0700
Subject: [Spambayes-checkins] spambayes timtest.py,1.12,1.13
tokenizer.py,1.6,1.7
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv4417
Modified Files:
timtest.py tokenizer.py
Log Message:
tokenize_word(): Stopped splitting the y in x@y on '.'. Improved the
f-n rate. The big loser for f-p was a message consisting entirely of
"Thanks guys", posted from an x@y address where y had a 0.99 spamprob,
but where y split in pieces had two significantly lower spamprobs.
Index: timtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtest.py,v
retrieving revision 1.12
retrieving revision 1.13
diff -C2 -d -r1.12 -r1.13
*** timtest.py 7 Sep 2002 16:15:45 -0000 1.12
--- timtest.py 8 Sep 2002 21:08:16 -0000 1.13
***************
*** 107,111 ****
random.seed(hash(directory))
random.shuffle(all)
! for fname in all[-500:]:
yield Msg(directory, fname)
--- 107,111 ----
random.seed(hash(directory))
random.shuffle(all)
! for fname in all[-1500:-1000:]:
yield Msg(directory, fname)
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** tokenizer.py 8 Sep 2002 18:54:09 -0000 1.6
--- tokenizer.py 8 Sep 2002 21:08:16 -0000 1.7
***************
*** 569,577 ****
# Don't want to skip embedded email addresses.
if n < 40 and '.' in word and word.count('@') == 1:
p1, p2 = word.split('@')
yield 'email name:' + p1
! for piece in p2.split('.'):
! yield 'email addr:' + piece
# If there are any high-bit chars,
--- 569,578 ----
# Don't want to skip embedded email addresses.
+ # An earlier scheme also split up the y in x@y on '.'. Not splitting
+ # improved the f-n rate; the f-p rate didn't care either way.
if n < 40 and '.' in word and word.count('@') == 1:
p1, p2 = word.split('@')
yield 'email name:' + p1
! yield 'email addr:' + p2
# If there are any high-bit chars,
From tim_one@users.sourceforge.net Sun Sep 8 22:29:07 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 08 Sep 2002 14:29:07 -0700
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.7,1.8
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv10235
Modified Files:
tokenizer.py
Log Message:
Fixed grammar in a comment, just because I forgot to post the new rates
after the last checkin (to simplify parsing of email addresses):
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.050 0.100 lost +100.00%
0.025 0.025 tied
0.075 0.050 won -33.33%
0.000 0.000 tied
0.075 0.075 tied
0.075 0.050 won -33.33%
0.025 0.025 tied
0.025 0.025 tied
0.050 0.050 tied
0.050 0.050 tied
0.050 0.050 tied
0.000 0.000 tied
0.000 0.000 tied
0.075 0.075 tied
0.025 0.025 tied
0.000 0.000 tied
0.025 0.025 tied
0.050 0.100 lost +100.00%
won 2 times
tied 16 times
lost 2 times
total unique fp went from 12 to 14 lost +16.67%
false negative percentages
0.327 0.291 won -11.01%
0.400 0.364 won -9.00%
0.364 0.254 won -30.22%
0.691 0.582 won -15.77%
0.545 0.545 tied
0.291 0.218 won -25.09%
0.291 0.218 won -25.09%
0.618 0.654 lost +5.83%
0.436 0.364 won -16.51%
0.327 0.255 won -22.02%
0.364 0.400 lost +9.89%
0.691 0.654 won -5.35%
0.618 0.618 tied
0.291 0.291 tied
0.291 0.291 tied
0.436 0.436 tied
0.473 0.436 won -7.82%
0.218 0.218 tied
0.291 0.255 won -12.37%
0.254 0.182 won -28.35%
won 12 times
tied 6 times
lost 2 times
total unique fn went from 110 to 101 won -8.18%
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.7
retrieving revision 1.8
diff -C2 -d -r1.7 -r1.8
*** tokenizer.py 8 Sep 2002 21:08:16 -0000 1.7
--- tokenizer.py 8 Sep 2002 21:29:05 -0000 1.8
***************
*** 604,609 ****
# all the charsets
#
! # This has huge benefit for the f-n rate, and virtually none on the f-p rate,
! # although it does reduce the variance of the f-p rate across different
# training sets (really marginal msgs, like a brief HTML msg saying just
# "unsubscribe me", are almost always tagged as spam now; before they were
--- 604,609 ----
# all the charsets
#
! # This has huge benefit for the f-n rate, and virtually no effect on the f-p
! # rate, although it does reduce the variance of the f-p rate across different
# training sets (really marginal msgs, like a brief HTML msg saying just
# "unsubscribe me", are almost always tagged as spam now; before they were
From tim_one@users.sourceforge.net Mon Sep 9 00:48:52 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 08 Sep 2002 16:48:52 -0700
Subject: [Spambayes-checkins] spambayes timtest.py,1.13,1.14
tokenizer.py,1.8,1.9
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv10431
Modified Files:
timtest.py tokenizer.py
Log Message:
Tried to treat src= params specially. It made no difference, so left
the code but commented it out. Refactored code to parse "file names"
as part of this, and left that change in.
Index: timtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtest.py,v
retrieving revision 1.13
retrieving revision 1.14
diff -C2 -d -r1.13 -r1.14
*** timtest.py 8 Sep 2002 21:08:16 -0000 1.13
--- timtest.py 8 Sep 2002 23:48:50 -0000 1.14
***************
*** 141,147 ****
self.trained_spam_hist = Hist(self.nbuckets)
! #f = file('w.pik', 'wb')
! #pickle.dump(self.classifier, f, 1)
! #f.close()
#import sys
#sys.exit(0)
--- 141,147 ----
self.trained_spam_hist = Hist(self.nbuckets)
! f = file('w.pik', 'wb')
! pickle.dump(self.classifier, f, 1)
! f.close()
#import sys
#sys.exit(0)
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.8
retrieving revision 1.9
diff -C2 -d -r1.8 -r1.9
*** tokenizer.py 8 Sep 2002 21:29:05 -0000 1.8
--- tokenizer.py 8 Sep 2002 23:48:50 -0000 1.9
***************
*** 558,561 ****
--- 558,587 ----
subject_word_re = re.compile(r"[\w\x80-\xff$.%]+")
+ # Anthony Baxter reported goodness from cracking src params.
+ # Finding a src= thingie is complicated if we insist it appear in an
+ # img or iframe tag, so this approximates reality with a fast and
+ # non-stack-blowing simple regexp.
+ src_re = re.compile(r"""
+ \s
+ src=['"]
+ (?!https?:) # we suck out http thingies via a different gimmick
+ ([^'"]{1,128}) # capture the guts, but don't go wild
+ ['"]
+ """, re.VERBOSE)
+
+ fname_sep_re = re.compile(r'[/\\:]')
+
+ def crack_filename(fname):
+ yield "fname:" + fname
+ components = fname_sep_re.split(fname)
+ morethan1 = len(components) > 1
+ for component in components:
+ if morethan1:
+ yield "fname comp:" + component
+ pieces = urlsep_re.split(component)
+ if len(pieces) > 1:
+ for piece in pieces:
+ yield "fname piece:" + piece
+
def tokenize_word(word, _len=len):
n = _len(word)
***************
*** 701,707 ****
fname = msg.get_filename()
if fname is not None:
! for x in fname.lower().split('/'):
! for y in x.split('.'):
! yield 'filename:' + y
if 0: # disabled; see comment before function
--- 727,732 ----
fname = msg.get_filename()
if fname is not None:
! for x in crack_filename(fname):
! yield 'filename:' + x
if 0: # disabled; see comment before function
***************
*** 874,877 ****
--- 899,913 ----
for chunk in urlsep_re.split(piece):
yield prefix + chunk
+
+ # Anthony Baxter reported goodness from tokenizing src= params.
+ # XXX This made no difference in my tests: both error rates
+ # XXX across 20 runs were identical before and after. I suspect
+ # XXX this is because Anthony got most good out of the http
+ # XXX thingies in , but we
+ # XXX picked those up in the last step (in src params and
+ # XXX everywhere else). So this code is commented out.
+ ## for fname in src_re.findall(text):
+ ## for x in crack_filename(fname):
+ ## yield "src:" + x
# Remove HTML/XML tags if it's a plain text message.
From tim_one@users.sourceforge.net Mon Sep 9 00:49:53 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 08 Sep 2002 16:49:53 -0700
Subject: [Spambayes-checkins] spambayes timtest.py,1.14,1.15
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv10775
Modified Files:
timtest.py
Log Message:
Oops -- checked in a private change by mistake. Backing it out.
Index: timtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtest.py,v
retrieving revision 1.14
retrieving revision 1.15
diff -C2 -d -r1.14 -r1.15
*** timtest.py 8 Sep 2002 23:48:50 -0000 1.14
--- timtest.py 8 Sep 2002 23:49:51 -0000 1.15
***************
*** 141,147 ****
self.trained_spam_hist = Hist(self.nbuckets)
! f = file('w.pik', 'wb')
! pickle.dump(self.classifier, f, 1)
! f.close()
#import sys
#sys.exit(0)
--- 141,147 ----
self.trained_spam_hist = Hist(self.nbuckets)
! #f = file('w.pik', 'wb')
! #pickle.dump(self.classifier, f, 1)
! #f.close()
#import sys
#sys.exit(0)
From gvanrossum@users.sourceforge.net Mon Sep 9 00:53:25 2002
From: gvanrossum@users.sourceforge.net (Guido van Rossum)
Date: Sun, 08 Sep 2002 16:53:25 -0700
Subject: [Spambayes-checkins] spambayes README.txt,1.11,1.12
GBayes.py,1.1,NONE
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv11378
Modified Files:
README.txt
Removed Files:
GBayes.py
Log Message:
Get rid of GBayes.py.
It was old and the relevant pieces are now in hammie.py.
Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.11
retrieving revision 1.12
diff -C2 -d -r1.11 -r1.12
*** README.txt 8 Sep 2002 17:46:14 -0000 1.11
--- README.txt 8 Sep 2002 23:53:23 -0000 1.12
***************
*** 49,57 ****
tokenize() function of your choosing.
- GBayes.py
- A number of tokenizers and a partial test driver. This assumes
- an mbox format. Could stand massive refactoring. I don't think
- it's been kept up to date.
-
Test Utilities
--- 49,52 ----
--- GBayes.py DELETED ---
From tim_one@users.sourceforge.net Mon Sep 9 05:56:14 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 08 Sep 2002 21:56:14 -0700
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.9,1.10
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv29522
Modified Files:
tokenizer.py
Log Message:
Pure win, from enabling Anthony's "count the mere # of various header
lines, case-sensitively" on a small subset of header lines. This
avoids all the header lines the union of Greg and Barry told me *might*
be artifacts of Mailman and/or BruceG's (the spam collector's)
email setup. It's an open question how much this may merely be
discriminating newsgroup traffic from non-newsgroup mail, but I also
left out what I thought were obvious newsgroupy headers (like References:).
The presence of X-Complaints-To happens to be a very strong discriminator
in my data, and accounts for redeeming 6 of the 14 previous false
positives.
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.100 0.050 won -50.00%
0.025 0.000 won -100.00%
0.050 0.025 won -50.00%
0.000 0.000 tied
0.075 0.075 tied
0.050 0.025 won -50.00%
0.025 0.025 tied
0.025 0.000 won -100.00%
0.050 0.050 tied
0.050 0.000 won -100.00%
0.050 0.025 won -50.00%
0.000 0.000 tied
0.000 0.000 tied
0.075 0.050 won -33.33%
0.025 0.025 tied
0.000 0.000 tied
0.025 0.025 tied
0.100 0.050 won -50.00%
won 9 times
tied 11 times
lost 0 times
total unique fp went from 14 to 8 won -42.86%
false negative percentages
0.291 0.255 won -12.37%
0.364 0.364 tied
0.254 0.254 tied
0.582 0.509 won -12.54%
0.545 0.436 won -20.00%
0.218 0.218 tied
0.218 0.182 won -16.51%
0.654 0.582 won -11.01%
0.364 0.327 won -10.16%
0.255 0.255 tied
0.400 0.254 won -36.50%
0.654 0.582 won -11.01%
0.618 0.545 won -11.81%
0.291 0.255 won -12.37%
0.291 0.291 tied
0.436 0.400 won -8.26%
0.436 0.291 won -33.26%
0.218 0.218 tied
0.255 0.218 won -14.51%
0.182 0.145 won -20.33%
won 14 times
tied 6 times
lost 0 times
total unique fn went from 101 to 89 won -11.88%
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.9
retrieving revision 1.10
diff -C2 -d -r1.9 -r1.10
*** tokenizer.py 8 Sep 2002 23:48:50 -0000 1.9
--- tokenizer.py 9 Sep 2002 04:56:12 -0000 1.10
***************
*** 745,748 ****
--- 745,770 ----
yield '.'.join(parts[:i])
+ # We're merely going to count the number of these, and case-sensitively.
+ safe_headers = Set("""
+ abuse-reports-to
+ date
+ errors-to
+ from
+ importance
+ in-reply-to
+ message-id
+ mime-version
+ organization
+ received
+ reply-to
+ return-path
+ subject
+ to
+ user-agent
+ x-abuse-info
+ x-complaints-to
+ x-face
+ """.split())
+
class Tokenizer:
***************
*** 823,835 ****
yield prefix + ' '.join(x.split())
- # Organization:
- # Oddly enough, tokenizing this doesn't make any difference to
- # results. However, noting its mere absence is strong enough
- # to give a tiny improvement in the f-n rate, and since
- # recording that requires only one token across the whole
- # database, the cost is also tiny.
- if msg.get('organization', None) is None:
- yield "bool:noorg"
-
# Received:
# Neil Schemenauer reported good results from tokenizing prefixes
--- 845,848 ----
***************
*** 867,870 ****
--- 880,891 ----
##for x in x2n.items():
## yield "header:%s:%d" % x
+
+ # Do a "safe" approximation to that for now.
+ x2n = {}
+ for x in msg.keys():
+ if x.lower() in safe_headers:
+ x2n[x] = x2n.get(x, 0) + 1
+ for x in x2n.items():
+ yield "header:%s:%d" % x
def tokenize_body(self, msg):
From tim_one@users.sourceforge.net Mon Sep 9 17:19:41 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Mon, 09 Sep 2002 09:19:41 -0700
Subject: [Spambayes-checkins] spambayes Options.py,NONE,1.1
README.txt,1.12,1.13
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv20464
Modified Files:
README.txt
Added Files:
Options.py
Log Message:
Options.options is intended to be shared global state, for customizing
what the classifier and tokenizer do in a controlled and reportable
way (note that options.display() produces a nice string spelling out the
options in effect). Nothing uses this yet.
--- NEW FILE: Options.py ---
from sets import Set
# Descriptions of options.
# Empty lines, and lines starting with a blank, are ignored.
# A line starting with a non-blank character is of the form:
# option_name "default" default_value
# option_name must not contain whitespace
# default_value must be eval'able.
option_descriptions = """
retain_pure_html_tags default False
By default, HTML tags are stripped from pure text/html messages.
Set retain_pure_html_tags True to retain HTML tags in this case.
"""
class OptionsClass(dict):
def __init__(self):
self.optnames = Set()
for line in option_descriptions.split('\n'):
if not line or line.startswith(' '):
continue
i = line.index(' ')
name = line[:i]
self.optnames.add(name)
i = line.index(' default ', i)
self.setopt(name, eval(line[i+9:], {}))
def _checkname(self, name):
if name not in self.optnames:
raise KeyError("there's no option named %r" % name)
def setopt(self, name, value):
self._checkname(name)
self[name] = value
def display(self):
"""Return a string showing current option values."""
result = ['Option values:\n']
width = max([len(name) for name in self.keys()])
items = self.items()
items.sort()
for name, value in items:
result.append(' %-*s: %r\n' % (width, name, value))
return ''.join(result)
options = OptionsClass()
Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.12
retrieving revision 1.13
diff -C2 -d -r1.12 -r1.13
*** README.txt 8 Sep 2002 23:53:23 -0000 1.12
--- README.txt 9 Sep 2002 16:19:39 -0000 1.13
***************
*** 22,25 ****
--- 22,32 ----
Primary Files
=============
+ Options.py
+ A start at a flexible way to control what the tokenizer and
+ classifier do. Different people are finding different ways in
+ which their test data is biased, and so fiddle the code to
+ worm around that. It's become almost impossible to know
+ exactly what someone did when they report results.
+
classifier.py
An implementation of a Graham-like classifier.
From tim_one@users.sourceforge.net Mon Sep 9 17:39:31 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Mon, 09 Sep 2002 09:39:31 -0700
Subject: [Spambayes-checkins] spambayes timtest.py,1.15,1.16
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv27227
Modified Files:
timtest.py
Log Message:
There's now a required int argument (-n) giving the number of ham/spam
sets in "the standard" test directory setup.
Also attempts to import bayescustomize. If that exists, it can be used
to fiddle the settings in Options.options.
Regardless of whether bayescustomize exists, the settings in
Options.options are now displayed at the start of the run.
Index: timtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtest.py,v
retrieving revision 1.15
retrieving revision 1.16
diff -C2 -d -r1.15 -r1.16
*** timtest.py 8 Sep 2002 23:49:51 -0000 1.15
--- timtest.py 9 Sep 2002 16:39:27 -0000 1.16
***************
*** 1,10 ****
#! /usr/bin/env python
! NSETS = 5
! SPAMDIRS = ["Data/Spam/Set%d" % i for i in range(1, NSETS+1)]
! HAMDIRS = ["Data/Ham/Set%d" % i for i in range(1, NSETS+1)]
! SPAMHAMDIRS = zip(SPAMDIRS, HAMDIRS)
import os
from sets import Set
import cPickle as pickle
--- 1,23 ----
#! /usr/bin/env python
+ # At the moment, this requires Python 2.3 from CVS (heapq, Set, enumerate).
! # A test driver using "the standard" test directory structure. See also
! # rates.py and cmp.py for summarizing results.
!
! """Usage: %(program)s [options]
!
! Where:
! -h
! Show usage and exit.
! -n int
! Number of Set directories (Data/Spam/Set1, ... and Data/Ham/Set1, ...).
! This is required.
!
! In addition, an attempt is made to import bayescustomize. If that exists,
! it can be used to change the settings in Options.options.
! """
import os
+ import sys
from sets import Set
import cPickle as pickle
***************
*** 15,18 ****
--- 28,39 ----
from tokenizer import tokenize
+ def usage(code, msg=''):
+ """Print usage message and sys.exit(code)."""
+ if msg:
+ print >> sys.stderr, msg
+ print >> sys.stderr
+ print >> sys.stderr, __doc__ % globals()
+ sys.exit(code)
+
class Hist:
def __init__(self, nbuckets=20):
***************
*** 217,226 ****
self.trained_spam_hist += local_spam_hist
! def drive():
! d = Driver()
! for spamdir, hamdir in SPAMHAMDIRS:
d.train(MsgStream(hamdir), MsgStream(spamdir))
! for sd2, hd2 in SPAMHAMDIRS:
if (sd2, hd2) == (spamdir, hamdir):
continue
--- 238,254 ----
self.trained_spam_hist += local_spam_hist
! def drive(nsets):
! import Options
! print Options.options.display()
!
! spamdirs = ["Data/Spam/Set%d" % i for i in range(1, nsets+1)]
! hamdirs = ["Data/Ham/Set%d" % i for i in range(1, nsets+1)]
! spamhamdirs = zip(spamdirs, hamdirs)
!
! d = Driver()
! for spamdir, hamdir in spamhamdirs:
d.train(MsgStream(hamdir), MsgStream(spamdir))
! for sd2, hd2 in spamhamdirs:
if (sd2, hd2) == (spamdir, hamdir):
continue
***************
*** 230,232 ****
if __name__ == "__main__":
! drive()
--- 258,282 ----
if __name__ == "__main__":
! import getopt
!
! try:
! opts, args = getopt.getopt(sys.argv[1:], 'hn:')
! except getopt.error, msg:
! usage(1, msg)
!
! nsets = None
! for opt, arg in opts:
! if opt == '-h':
! usage(0)
! elif opt == '-n':
! nsets = int(arg)
!
! if args:
! usage(1, "Positional arguments not supported")
!
! try:
! import bayescustomize
! except ImportError:
! pass
!
! drive(nsets)
From tim_one@users.sourceforge.net Mon Sep 9 17:31:54 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Mon, 09 Sep 2002 09:31:54 -0700
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.10,1.11
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv25291
Modified Files:
tokenizer.py
Log Message:
Whether the tokenizer strips HTML tags from pure HTML msgs is now
controlled by the the setting of Options.options['retain_pure_html_tags'].
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.10
retrieving revision 1.11
diff -C2 -d -r1.10 -r1.11
*** tokenizer.py 9 Sep 2002 04:56:12 -0000 1.10
--- tokenizer.py 9 Sep 2002 16:31:50 -0000 1.11
***************
*** 5,8 ****
--- 5,10 ----
from sets import Set
+ from Options import options
+
##############################################################################
# To fold case or not to fold case? I didn't want to fold case, because
***************
*** 890,893 ****
--- 892,907 ----
def tokenize_body(self, msg):
+ """Generate a stream of tokens from an email Message.
+
+ If a multipart/alternative section has both text/plain and text/html
+ sections, the text/html section is ignored. This may not be a good
+ idea (e.g., the sections may have different content).
+
+ HTML tags are always stripped from text/plain sections.
+
+ Options.options['retain_pure_html_tags'] controls whether HTML tags are
+ also stripped from text/html sections.
+ """
+
# Find, decode (base64, qp), and tokenize textual parts of the body.
for part in textparts(msg):
***************
*** 932,937 ****
## yield "src:" + x
! # Remove HTML/XML tags if it's a plain text message.
! if part.get_content_type() == "text/plain":
text = html_re.sub(' ', text)
--- 946,952 ----
## yield "src:" + x
! # Remove HTML/XML tags.
! if (part.get_content_type() == "text/plain" or
! not options['retain_pure_html_tags']):
text = html_re.sub(' ', text)
From tim_one@users.sourceforge.net Mon Sep 9 19:49:21 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Mon, 09 Sep 2002 11:49:21 -0700
Subject: [Spambayes-checkins] spambayes Options.py,1.1,1.2
tokenizer.py,1.11,1.12
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv28560
Modified Files:
Options.py tokenizer.py
Log Message:
Moved the safe_headers set into the options.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** Options.py 9 Sep 2002 16:19:38 -0000 1.1
--- Options.py 9 Sep 2002 18:49:18 -0000 1.2
***************
*** 10,15 ****
option_descriptions = """
retain_pure_html_tags default False
! By default, HTML tags are stripped from pure text/html messages.
! Set retain_pure_html_tags True to retain HTML tags in this case.
"""
--- 10,24 ----
option_descriptions = """
retain_pure_html_tags default False
! By default, tokenizer.Tokenizer.tokenize_headers() strips HTML tags
! stripped from pure text/html messages. Set to True to retain HTML tags
! in this case.
!
! safe_headers default Set("abuse-reports-to date errors-to from importance in-reply-to message-id mime-version organization received reply-to return-path subject to user-agent x-abuse-info x-complaints-to x-face".split())
! tokenizer.Tokenizer.tokenize_headers() generates tokens just counting
! the number of instances of the headers in this set, in a case-sensitive
! way. Depending on data collection, some headers aren't safe to count.
! For example, if ham is collected from a mailing list but spam from your
! regular inbox traffic, the presence of a header like List-Info will be a
! very strong ham clue, but a bogus one.
"""
***************
*** 17,20 ****
--- 26,30 ----
def __init__(self):
self.optnames = Set()
+ evaldict = {'Set': Set}
for line in option_descriptions.split('\n'):
if not line or line.startswith(' '):
***************
*** 24,28 ****
self.optnames.add(name)
i = line.index(' default ', i)
! self.setopt(name, eval(line[i+9:], {}))
def _checkname(self, name):
--- 34,38 ----
self.optnames.add(name)
i = line.index(' default ', i)
! self.setopt(name, eval(line[i+9:], evaldict))
def _checkname(self, name):
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.11
retrieving revision 1.12
diff -C2 -d -r1.11 -r1.12
*** tokenizer.py 9 Sep 2002 16:31:50 -0000 1.11
--- tokenizer.py 9 Sep 2002 18:49:19 -0000 1.12
***************
*** 747,772 ****
yield '.'.join(parts[:i])
- # We're merely going to count the number of these, and case-sensitively.
- safe_headers = Set("""
- abuse-reports-to
- date
- errors-to
- from
- importance
- in-reply-to
- message-id
- mime-version
- organization
- received
- reply-to
- return-path
- subject
- to
- user-agent
- x-abuse-info
- x-complaints-to
- x-face
- """.split())
-
class Tokenizer:
--- 747,750 ----
***************
*** 884,887 ****
--- 862,866 ----
# Do a "safe" approximation to that for now.
+ safe_headers = options['safe_headers']
x2n = {}
for x in msg.keys():
From montanaro@users.sourceforge.net Mon Sep 9 20:23:26 2002
From: montanaro@users.sourceforge.net (Skip Montanaro)
Date: Mon, 09 Sep 2002 12:23:26 -0700
Subject: [Spambayes-checkins] spambayes loosecksum.py,NONE,1.1
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv2588
Added Files:
loosecksum.py
Log Message:
calculate a "loose" checksum for an email message
--- NEW FILE: loosecksum.py ---
#!/usr/local/bin/python
"""
Compute a 'loose' checksum on the msg (file on cmdline or via stdin).
Attempts are made to eliminate content which tends to obscure the 'sameness'
of messages. This is aimed particularly at spam, which tends to contains
lots of small differences across messages to try and thwart spam filters, in
hopes that at least one copy reaches its desitination.
Before calculating the checksum, this script does the following:
* delete the message header
* delete HTML tags which generally contain URLs
* delete anything which looks like an email address or URL
* finally, discard everything other than ascii letters and digits (note
that this will almost certainly be ineffectual for spam written in
eastern languages such as Korean)
An MD5 checksum is then computed for the resulting text and written to stdout.
"""
import getopt
import sys
import email.Parser
import md5
import re
import time
import binascii
def zaptags(data, *tags):
"""delete all tags (and /tags) from input data given as arguments"""
for pat in tags:
pat = pat.split(":")
sub = ""
if len(pat) >= 2:
sub = pat[-1]
pat = ":".join(pat[:-1])
else:
pat = pat[0]
sub = ""
if '\\' in sub:
sub = _zap_esc_map(sub)
try:
data = re.sub(r'(?i)?(%s)(?:\s[^>]*)?>'%pat, sub, data)
except TypeError:
print (pat, sub, data)
raise
return data
def clean(data):
"""Clean the obviously variable stuff from a chunk of data.
The first (and perhaps only) use of this is to try and eliminate bits
of data that keep multiple spam email messages from looking the same.
"""
# Get rid of any HTML tags that hold URLs - tend to have varying content
# I suppose i could just get rid of all HTML tags
data = zaptags(data, 'a', 'img', 'base', 'frame')
# delete anything that looks like an email address
data = re.sub(r"(?i)[-a-z0-9_.+]+@[-a-z0-9_.]+\.([a-z]+)", "", data)
# delete anything that looks like a url (catch bare urls)
data = re.sub(r"(?i)(ftp|http|gopher)://[-a-z0-9_/?&%@=+:;#!~|.,$*]+", "", data)
# throw away everything other than alpha & digits
return re.sub(r"[^A-Za-z0-9]+", "", data)
def flatten(obj):
# I do not know how to use the email package very well - all I want here
# is the body of obj expressed as a string - there is probably a better
# way to accomplish this which I haven't discovered.
# three types are possible: string, Message (hasattr(get_payload)), list
if isinstance(obj, str):
return obj
if hasattr(obj, "get_payload"):
return flatten(obj.get_payload())
if isinstance(obj, list):
return "\n".join([flatten(b) for b in body])
raise TypeError, ("unrecognized body type: %s" % type(body))
def generate_checksum(f):
body = flatten(email.Parser.Parser().parse(f))
return binascii.b2a_hex(md5.new(clean(body)).digest())
def main(args):
opts, args = getopt.getopt(args, "")
for opt, arg in opts:
pass
if not args:
inf = sys.stdin
else:
inf = file(args[0])
print generate_checksum(inf)
if __name__ == "__main__":
main(sys.argv[1:])
From montanaro@users.sourceforge.net Mon Sep 9 20:24:54 2002
From: montanaro@users.sourceforge.net (Skip Montanaro)
Date: Mon, 09 Sep 2002 12:24:54 -0700
Subject: [Spambayes-checkins] spambayes README.txt,1.13,1.14
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv2873
Modified Files:
README.txt
Log Message:
add blurb about loosecksum.py
Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.13
retrieving revision 1.14
diff -C2 -d -r1.13 -r1.14
*** README.txt 9 Sep 2002 16:19:39 -0000 1.13
--- README.txt 9 Sep 2002 19:24:52 -0000 1.14
***************
*** 79,82 ****
--- 79,86 ----
useful to delete headers which incorrectly might bias the results.
+ loosecksum.py
+ A script to calculate a "loose" checksum for a message. See the text of
+ the script for an operational definition of "loose".
+
mboxcount.py
Count the number of messages (both parseable and unparseable) in
From nascheme@users.sourceforge.net Mon Sep 9 22:21:56 2002
From: nascheme@users.sourceforge.net (Neil Schemenauer)
Date: Mon, 09 Sep 2002 14:21:56 -0700
Subject: [Spambayes-checkins]
spambayes cdb.py,NONE,1.1 neilfilter.py,NONE,1.1 neiltrain.py,NONE,1.1
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv8534
Added Files:
cdb.py neilfilter.py neiltrain.py
Log Message:
Add a pure Python implementation of CDB and two scripts that use it. It
seems pretty zippy for both reading and creating.
--- NEW FILE: cdb.py ---
"""
Dan Bernstein's CDB implemented in Python
see http://cr.yp.to/cdb.html
"""
import os
import struct
import mmap
import sys
def uint32_unpack(buf):
return struct.unpack('>= 8
u %= self.hslots
u <<= 3
self.kpos = self.hpos + u
while self.loop < self.hslots:
buf = self.read(8, self.kpos)
pos = uint32_unpack(buf[4:])
if not pos:
raise KeyError
self.loop += 1
self.kpos += 8
if self.kpos == self.hpos + (self.hslots << 3):
self.kpos = self.hpos
u = uint32_unpack(buf[:4])
if u == self.khash:
buf = self.read(8, pos)
u = uint32_unpack(buf[:4])
if u == len(key):
if self.match(key, pos + 8):
dlen = uint32_unpack(buf[4:])
dpos = pos + 8 + len(key)
return self.read(dlen, dpos)
raise KeyError
def __getitem__(self, key):
self.findstart()
return self.findnext(key)
def get(self, key, default=None):
self.findstart()
try:
return self.findnext(key)
except KeyError:
return default
def cdb_make(outfile, items):
pos = 2048
tables = {} # { h & 255 : [(h, p)] }
# write keys and data
outfile.seek(pos)
for key, value in items:
outfile.write(uint32_pack(len(key)) + uint32_pack(len(value)))
h = cdb_hash(key)
outfile.write(key)
outfile.write(value)
tables.setdefault(h & 255, []).append((h, pos))
pos += 8 + len(key) + len(value)
final = ''
# write hash tables
for i in range(256):
entries = tables.get(i, [])
nslots = 2*len(entries)
final += uint32_pack(pos) + uint32_pack(nslots)
null = (0, 0)
table = [null] * nslots
for h, p in entries:
n = (h >> 8) % nslots
while table[n] is not null:
n = (n + 1) % nslots
table[n] = (h, p)
for h, p in table:
outfile.write(uint32_pack(h) + uint32_pack(p))
pos += 8
# write header (pointers to tables and their lengths)
outfile.flush()
outfile.seek(0)
outfile.write(final)
def test():
#db = Cdb(open("t"))
#print db['one']
#print db['two']
#print db['foo']
#print db['us']
#print db.get('ec')
#print db.get('notthere')
db = open('test.cdb', 'wb')
cdb_make(db,
[('one', 'Hello'),
('two', 'Goodbye'),
('foo', 'Bar'),
('us', 'United States'),
])
db.close()
db = Cdb(open("test.cdb", 'rb'))
print db['one']
print db['two']
print db['foo']
print db['us']
print db.get('ec')
print db.get('notthere')
if __name__ == '__main__':
test()
--- NEW FILE: neilfilter.py ---
#! /usr/bin/env python
"""Usage: %(program)s wordprobs.cdb
"""
import sys
import os
import email
from heapq import heapreplace
from sets import Set
from classifier import MIN_SPAMPROB, MAX_SPAMPROB, UNKNOWN_SPAMPROB, \
MAX_DISCRIMINATORS
import cdb
program = sys.argv[0] # For usage(); referenced by docstring above
from tokenizer import tokenize
def spamprob(wordprobs, wordstream, evidence=False):
"""Return best-guess probability that wordstream is spam.
wordprobs is a CDB of word probabilities
wordstream is an iterable object producing words.
The return value is a float in [0.0, 1.0].
If optional arg evidence is True, the return value is a pair
probability, evidence
where evidence is a list of (word, probability) pairs.
"""
# A priority queue to remember the MAX_DISCRIMINATORS best
# probabilities, where "best" means largest distance from 0.5.
# The tuples are (distance, prob, word).
nbest = [(-1.0, None, None)] * MAX_DISCRIMINATORS
smallest_best = -1.0
mins = [] # all words w/ prob MIN_SPAMPROB
maxs = [] # all words w/ prob MAX_SPAMPROB
# Counting a unique word multiple times hurts, although counting one
# at most two times had some benefit whan UNKNOWN_SPAMPROB was 0.2.
# When that got boosted to 0.5, counting more than once became
# counterproductive.
for word in Set(wordstream):
prob = float(wordprobs.get(word, UNKNOWN_SPAMPROB))
distance = abs(prob - 0.5)
if prob == MIN_SPAMPROB:
mins.append((distance, prob, word))
elif prob == MAX_SPAMPROB:
maxs.append((distance, prob, word))
elif distance > smallest_best:
# Subtle: we didn't use ">" instead of ">=" just to save
# calls to heapreplace(). The real intent is that if
# there are many equally strong indicators throughout the
# message, we want to favor the ones that appear earliest:
# it's expected that spam headers will often have smoking
# guns, and, even when not, spam has to grab your attention
# early (& note that when spammers generate large blocks of
# random gibberish to throw off exact-match filters, it's
# always at the end of the msg -- if they put it at the
# start, *nobody* would read the msg).
heapreplace(nbest, (distance, prob, word))
smallest_best = nbest[0][0]
# Compute the probability. Note: This is what Graham's code did,
# but it's dubious for reasons explained in great detail on Python-
# Dev: it's missing P(spam) and P(not-spam) adjustments that
# straightforward Bayesian analysis says should be here. It's
# unclear how much it matters, though, as the omissions here seem
# to tend in part to cancel out distortions introduced earlier by
# HAMBIAS. Experiments will decide the issue.
clues = []
# First cancel out competing extreme clues (see comment block at
# MAX_DISCRIMINATORS declaration -- this is a twist on Graham).
if mins or maxs:
if len(mins) < len(maxs):
shorter, longer = mins, maxs
else:
shorter, longer = maxs, mins
tokeep = min(len(longer) - len(shorter), MAX_DISCRIMINATORS)
# They're all good clues, but we're only going to feed the tokeep
# initial clues from the longer list into the probability
# computation.
for dist, prob, word in shorter + longer[tokeep:]:
if evidence:
clues.append((word, prob))
for x in longer[:tokeep]:
heapreplace(nbest, x)
prob_product = inverse_prob_product = 1.0
for distance, prob, word in nbest:
if prob is None: # it's one of the dummies nbest started with
continue
if evidence:
clues.append((word, prob))
prob_product *= prob
inverse_prob_product *= 1.0 - prob
prob = prob_product / (prob_product + inverse_prob_product)
if evidence:
clues.sort(lambda a, b: cmp(a[1], b[1]))
return prob, clues
else:
return prob
def formatclues(clues, sep="; "):
"""Format the clues into something readable."""
return sep.join(["%r: %.2f" % (word, prob) for word, prob in clues])
def is_spam(wordprobs, input):
"""Filter (judge) a message"""
msg = email.message_from_file(input)
prob, clues = spamprob(wordprobs, tokenize(msg), True)
#print "%.2f;" % prob, formatclues(clues)
if prob < 0.9:
return False
else:
return True
def usage(code, msg=''):
"""Print usage message and sys.exit(code)."""
if msg:
print >> sys.stderr, msg
print >> sys.stderr
print >> sys.stderr, __doc__ % globals()
sys.exit(code)
def main():
if len(sys.argv) != 2:
usage(2)
wordprobs = cdb.Cdb(open(sys.argv[1], 'rb'))
if is_spam(wordprobs, sys.stdin):
sys.exit(1)
else:
sys.exit(0)
if __name__ == "__main__":
main()
--- NEW FILE: neiltrain.py ---
#! /usr/bin/env python
"""Usage: %(program)s spam.mbox ham.mbox wordprobs.cdb
"""
import sys
import os
import mailbox
import email
import classifier
import cdb
program = sys.argv[0] # For usage(); referenced by docstring above
from tokenizer import tokenize
def getmbox(msgs):
"""Return an iterable mbox object"""
def _factory(fp):
try:
return email.message_from_file(fp)
except email.Errors.MessageParseError:
return ''
if msgs.startswith("+"):
import mhlib
mh = mhlib.MH()
mbox = mailbox.MHMailbox(os.path.join(mh.getpath(), msgs[1:]),
_factory)
elif os.path.isdir(msgs):
# XXX Bogus: use an MHMailbox if the pathname contains /Mail/,
# else a DirOfTxtFileMailbox.
if msgs.find("/Mail/") >= 0:
mbox = mailbox.MHMailbox(msgs, _factory)
else:
mbox = DirOfTxtFileMailbox(msgs, _factory)
else:
fp = open(msgs)
mbox = mailbox.PortableUnixMailbox(fp, _factory)
return mbox
def train(bayes, msgs, is_spam):
"""Train bayes with all messages from a mailbox."""
mbox = getmbox(msgs)
for msg in mbox:
bayes.learn(tokenize(msg), is_spam, False)
def usage(code, msg=''):
"""Print usage message and sys.exit(code)."""
if msg:
print >> sys.stderr, msg
print >> sys.stderr
print >> sys.stderr, __doc__ % globals()
sys.exit(code)
def main():
"""Main program; parse options and go."""
if len(sys.argv) != 4:
usage(2)
spam_name = sys.argv[1]
ham_name = sys.argv[2]
db_name = sys.argv[3]
bayes = classifier.GrahamBayes()
print 'Training with spam...'
train(bayes, spam_name, True)
print 'Training with ham...'
train(bayes, ham_name, False)
print 'Updating probabilities...'
bayes.update_probabilities()
items = []
for word, winfo in bayes.wordinfo.iteritems():
#print `word`, str(winfo.spamprob)
items.append((word, str(winfo.spamprob)))
print 'Writing DB...'
db = open(db_name, "wb")
cdb.cdb_make(db, items)
db.close()
print 'done'
if __name__ == "__main__":
main()
From tim_one@users.sourceforge.net Tue Sep 10 01:06:39 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Mon, 09 Sep 2002 17:06:39 -0700
Subject: [Spambayes-checkins]
spambayes Options.py,1.3,1.4 bayes.ini,1.1,1.2 timtest.py,1.17,1.18
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv30316
Modified Files:
Options.py bayes.ini timtest.py
Log Message:
Added a bunch of options to control the test driver.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** Options.py 9 Sep 2002 20:37:14 -0000 1.3
--- Options.py 10 Sep 2002 00:06:36 -0000 1.4
***************
*** 9,18 ****
from sets import Set
! __all__ = ['buildoptions', 'options']
all_options = {
! 'Tokenizer': {'retain_pure_html_tags': ('getboolean', lambda i: bool(i)),
'safe_headers': ('get', lambda s: Set(s.split())),
},
}
--- 9,32 ----
from sets import Set
! __all__ = ['options']
!
! int_cracker = ('getint', None)
! float_cracker = ('getfloat', None)
! boolean_cracker = ('getboolean', bool)
all_options = {
! 'Tokenizer': {'retain_pure_html_tags': boolean_cracker,
'safe_headers': ('get', lambda s: Set(s.split())),
},
+ 'TestDriver': {'nbuckets': int_cracker,
+ 'show_ham_lo': float_cracker,
+ 'show_ham_hi': float_cracker,
+ 'show_spam_lo': float_cracker,
+ 'show_spam_hi': float_cracker,
+ 'show_false_positives': boolean_cracker,
+ 'show_false_negatives': boolean_cracker,
+ 'show_histograms': boolean_cracker,
+ 'show_best_discriminators': boolean_cracker,
+ }
}
***************
*** 39,44 ****
continue
fetcher, converter = goodopts[option]
! rawvalue = getattr(c, fetcher)(section, option)
! value = converter(rawvalue)
setattr(options, option, value)
--- 53,59 ----
continue
fetcher, converter = goodopts[option]
! value = getattr(c, fetcher)(section, option)
! if converter is not None:
! value = converter(value)
setattr(options, option, value)
Index: bayes.ini
===================================================================
RCS file: /cvsroot/spambayes/spambayes/bayes.ini,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** bayes.ini 9 Sep 2002 20:37:14 -0000 1.1
--- bayes.ini 10 Sep 2002 00:06:37 -0000 1.2
***************
*** 1,6 ****
[Tokenizer]
# By default, tokenizer.Tokenizer.tokenize_headers() strips HTML tags
! # stripped from pure text/html messages. Set to True to retain HTML tags
! # in this case.
retain_pure_html_tags: False
--- 1,6 ----
[Tokenizer]
# By default, tokenizer.Tokenizer.tokenize_headers() strips HTML tags
! # from pure text/html messages. Set to True to retain HTML tags in
! # this case.
retain_pure_html_tags: False
***************
*** 29,30 ****
--- 29,49 ----
x-complaints-to
x-face
+
+ [TestDriver]
+ # These control various displays in class Drive (timtest.py).
+
+ # Number of buckets in histograms.
+ nbuckets: 40
+ show_histograms: True
+
+ # Display spam when
+ # show_spam_lo <= spamprob <= show_spam_hi
+ # and likewise for ham. The defaults here don't show anything.
+ show_spam_lo: 1.0
+ show_spam_hi: 0.0
+ show_ham_lo: 1.0
+ show_ham_hi: 0.0
+
+ show_false_positives: True
+ show_false_negatives: False
+ show_best_discriminators: True
Index: timtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtest.py,v
retrieving revision 1.17
retrieving revision 1.18
diff -C2 -d -r1.17 -r1.18
*** timtest.py 9 Sep 2002 20:37:14 -0000 1.17
--- timtest.py 10 Sep 2002 00:06:37 -0000 1.18
***************
*** 5,9 ****
# rates.py and cmp.py for summarizing results.
! """Usage: %(program)s [options]
Where:
--- 5,9 ----
# rates.py and cmp.py for summarizing results.
! """Usage: %(program)s [-h] -n nsets
Where:
***************
*** 27,31 ****
import classifier
from tokenizer import tokenize
! import Options
def usage(code, msg=''):
--- 27,33 ----
import classifier
from tokenizer import tokenize
! from Options import options
!
! program = sys.argv[0]
def usage(code, msg=''):
***************
*** 145,154 ****
class Driver:
! def __init__(self, nbuckets=40):
! self.nbuckets = nbuckets
self.falsepos = Set()
self.falseneg = Set()
! self.global_ham_hist = Hist(self.nbuckets)
! self.global_spam_hist = Hist(self.nbuckets)
def train(self, ham, spam):
--- 147,155 ----
class Driver:
! def __init__(self):
self.falsepos = Set()
self.falseneg = Set()
! self.global_ham_hist = Hist(options.nbuckets)
! self.global_spam_hist = Hist(options.nbuckets)
def train(self, ham, spam):
***************
*** 160,165 ****
print t.nham, "hams &", t.nspam, "spams"
! self.trained_ham_hist = Hist(self.nbuckets)
! self.trained_spam_hist = Hist(self.nbuckets)
#f = file('w.pik', 'wb')
--- 161,166 ----
print t.nham, "hams &", t.nspam, "spams"
! self.trained_ham_hist = Hist(options.nbuckets)
! self.trained_spam_hist = Hist(options.nbuckets)
#f = file('w.pik', 'wb')
***************
*** 170,195 ****
def finishtest(self):
! printhist("all in this training set:",
! self.trained_ham_hist, self.trained_spam_hist)
self.global_ham_hist += self.trained_ham_hist
self.global_spam_hist += self.trained_spam_hist
def alldone(self):
! printhist("all runs:", self.global_ham_hist, self.global_spam_hist)
def test(self, ham, spam, charlimit=None):
c = self.classifier
t = self.tester
! local_ham_hist = Hist(self.nbuckets)
! local_spam_hist = Hist(self.nbuckets)
! def new_ham(msg, prob):
local_ham_hist.add(prob)
! def new_spam(msg, prob):
local_spam_hist.add(prob)
! if prob < 0.1:
print
! print "Low prob spam!", prob
prob, clues = c.spamprob(msg, True)
printmsg(msg, prob, clues, charlimit)
--- 171,205 ----
def finishtest(self):
! if options.show_histograms:
! printhist("all in this training set:",
! self.trained_ham_hist, self.trained_spam_hist)
self.global_ham_hist += self.trained_ham_hist
self.global_spam_hist += self.trained_spam_hist
def alldone(self):
! if options.show_histograms:
! printhist("all runs:", self.global_ham_hist, self.global_spam_hist)
def test(self, ham, spam, charlimit=None):
c = self.classifier
t = self.tester
! local_ham_hist = Hist(options.nbuckets)
! local_spam_hist = Hist(options.nbuckets)
! def new_ham(msg, prob, lo=options.show_ham_lo,
! hi=options.show_ham_hi):
local_ham_hist.add(prob)
+ if lo <= prob <= hi:
+ print
+ print "Ham with prob =", prob
+ prob, clues = c.spamprob(msg, True)
+ printmsg(msg, prob, clues, charlimit)
! def new_spam(msg, prob, lo=options.show_spam_lo,
! hi=options.show_spam_hi):
local_spam_hist.add(prob)
! if lo <= prob <= hi:
print
! print "Spam with prob =", prob
prob, clues = c.spamprob(msg, True)
printmsg(msg, prob, clues, charlimit)
***************
*** 207,210 ****
--- 217,222 ----
self.falsepos |= newfpos
print " new false positives:", [e.tag for e in newfpos]
+ if not options.show_false_positives:
+ newfpos = ()
for e in newfpos:
print '*' * 78
***************
*** 215,244 ****
self.falseneg |= newfneg
print " new false negatives:", [e.tag for e in newfneg]
! for e in []:#newfneg:
print '*' * 78
prob, clues = c.spamprob(e, True)
printmsg(e, prob, clues, 1000)
! print
! print " best discriminators:"
! stats = [(-1, None) for i in range(30)]
! smallest_killcount = -1
! for w, r in c.wordinfo.iteritems():
! if r.killcount > smallest_killcount:
! heapreplace(stats, (r.killcount, w))
! smallest_killcount = stats[0][0]
! stats.sort()
! for count, w in stats:
! if count < 0:
! continue
! r = c.wordinfo[w]
! print " %r %d %g" % (w, r.killcount, r.spamprob)
! printhist("this pair:", local_ham_hist, local_spam_hist)
self.trained_ham_hist += local_ham_hist
self.trained_spam_hist += local_spam_hist
def drive(nsets):
! print Options.options.display()
spamdirs = ["Data/Spam/Set%d" % i for i in range(1, nsets+1)]
--- 227,260 ----
self.falseneg |= newfneg
print " new false negatives:", [e.tag for e in newfneg]
! if not options.show_false_negatives:
! newfneg = ()
! for e in newfneg:
print '*' * 78
prob, clues = c.spamprob(e, True)
printmsg(e, prob, clues, 1000)
! if options.show_best_discriminators:
! print
! print " best discriminators:"
! stats = [(-1, None) for i in range(30)]
! smallest_killcount = -1
! for w, r in c.wordinfo.iteritems():
! if r.killcount > smallest_killcount:
! heapreplace(stats, (r.killcount, w))
! smallest_killcount = stats[0][0]
! stats.sort()
! for count, w in stats:
! if count < 0:
! continue
! r = c.wordinfo[w]
! print " %r %d %g" % (w, r.killcount, r.spamprob)
! if options.show_histograms:
! printhist("this pair:", local_ham_hist, local_spam_hist)
self.trained_ham_hist += local_ham_hist
self.trained_spam_hist += local_spam_hist
def drive(nsets):
! print options.display()
spamdirs = ["Data/Spam/Set%d" % i for i in range(1, nsets+1)]
***************
*** 273,277 ****
if args:
usage(1, "Positional arguments not supported")
! Options.options.mergefiles(['bayescustomize.ini'])
drive(nsets)
--- 289,295 ----
if args:
usage(1, "Positional arguments not supported")
+ if nsets is None:
+ usage(1, "-n is required")
! options.mergefiles(['bayescustomize.ini'])
drive(nsets)
From tim_one@users.sourceforge.net Tue Sep 10 02:53:14 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Mon, 09 Sep 2002 18:53:14 -0700
Subject: [Spambayes-checkins] spambayes Options.py,1.4,1.5
timtest.py,1.18,1.19
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv9100
Modified Files:
Options.py timtest.py
Log Message:
Screwed my head on straight: Options should take care of merging in
bayescustomize.ini rather than making every client muck with it.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** Options.py 10 Sep 2002 00:06:36 -0000 1.4
--- Options.py 10 Sep 2002 01:53:12 -0000 1.5
***************
*** 65,67 ****
options = OptionsClass()
! options.mergefiles(['bayes.ini'])
--- 65,67 ----
options = OptionsClass()
! options.mergefiles(['bayes.ini', 'bayescustomize.ini'])
Index: timtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtest.py,v
retrieving revision 1.18
retrieving revision 1.19
diff -C2 -d -r1.18 -r1.19
*** timtest.py 10 Sep 2002 00:06:37 -0000 1.18
--- timtest.py 10 Sep 2002 01:53:12 -0000 1.19
***************
*** 292,295 ****
usage(1, "-n is required")
- options.mergefiles(['bayescustomize.ini'])
drive(nsets)
--- 292,294 ----
From tim_one@users.sourceforge.net Tue Sep 10 17:02:45 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Tue, 10 Sep 2002 09:02:45 -0700
Subject: [Spambayes-checkins]
spambayes Options.py,1.5,1.6 bayes.ini,1.2,1.3 tokenizer.py,1.13,1.14
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv30619
Modified Files:
Options.py bayes.ini tokenizer.py
Log Message:
Added option Tokenizer/count_all_header_lines. Defaults to False. You
can override by creating a bayescustomize.ini. When True, the
safe_headers option is ignored and Anthony's code to count *all* header
lines is used instead. This is almost certainly a Good Thing to do if
your ham and spam come from the same source, and almost certainly a
Bad Thing to do if they're from different sources (too many clues about
the source are likely to appear in the header-line counts).
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** Options.py 10 Sep 2002 01:53:12 -0000 1.5
--- Options.py 10 Sep 2002 16:02:40 -0000 1.6
***************
*** 18,21 ****
--- 18,22 ----
'Tokenizer': {'retain_pure_html_tags': boolean_cracker,
'safe_headers': ('get', lambda s: Set(s.split())),
+ 'count_all_header_lines': boolean_cracker,
},
'TestDriver': {'nbuckets': int_cracker,
***************
*** 28,32 ****
'show_histograms': boolean_cracker,
'show_best_discriminators': boolean_cracker,
! }
}
--- 29,33 ----
'show_histograms': boolean_cracker,
'show_best_discriminators': boolean_cracker,
! },
}
Index: bayes.ini
===================================================================
RCS file: /cvsroot/spambayes/spambayes/bayes.ini,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** bayes.ini 10 Sep 2002 00:06:37 -0000 1.2
--- bayes.ini 10 Sep 2002 16:02:41 -0000 1.3
***************
*** 5,14 ****
retain_pure_html_tags: False
! # tokenizer.Tokenizer.tokenize_headers() generates tokens just counting
! # the number of instances of the headers in this set, in a case-sensitive
! # way. Depending on data collection, some headers aren't safe to count.
# For example, if ham is collected from a mailing list but spam from your
# regular inbox traffic, the presence of a header like List-Info will be a
! # very strong ham clue, but a bogus one.
safe_headers: abuse-reports-to
date
--- 5,22 ----
retain_pure_html_tags: False
! # Generate tokens just counting the number of instances of each kind of
! # header line, in a case-sensitive way.
! #
! # Depending on data collection, some headers aren't safe to count.
# For example, if ham is collected from a mailing list but spam from your
# regular inbox traffic, the presence of a header like List-Info will be a
! # very strong ham clue, but a bogus one. In that case, set
! # count_all_header_lines to False, and adjust safe_headers instead.
!
! count_all_header_lines: False
!
! # Like count_all_header_lines, but restricted to headers in this list.
! # safe_headers is ignored when count_all_header_lines is true.
!
safe_headers: abuse-reports-to
date
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.13
retrieving revision 1.14
diff -C2 -d -r1.13 -r1.14
*** tokenizer.py 9 Sep 2002 20:37:14 -0000 1.13
--- tokenizer.py 10 Sep 2002 16:02:41 -0000 1.14
***************
*** 839,870 ****
yield 'received:' + tok
! # XXX Following is a great idea due to Anthony Baxter. I can't use it
! # XXX on my test data because the header lines are so different between
! # XXX my ham and spam that it makes a large improvement for bogus
! # XXX reasons. So it's commented out. But it's clearly a good thing
! # XXX to do on "normal" data, and subsumes the Organization trick above
! # XXX in a much more general way, yet at comparable cost.
!
! # X-UIDL:
! # Anthony Baxter's idea. This has spamprob 0.99! The value
! # is clearly irrelevant, just the presence or absence matters.
! # However, it's extremely rare in my spam sets, so doesn't
! # have much value.
! #
! # As also suggested by Anthony, we can capture all such header
! # oddities just by generating tags for the count of how many
! # times each header field appears.
! ##x2n = {}
! ##for x in msg.keys():
! ## x2n[x] = x2n.get(x, 0) + 1
! ##for x in x2n.items():
! ## yield "header:%s:%d" % x
!
! # Do a "safe" approximation to that for now.
! safe_headers = options.safe_headers
x2n = {}
! for x in msg.keys():
! if x.lower() in safe_headers:
x2n[x] = x2n.get(x, 0) + 1
for x in x2n.items():
yield "header:%s:%d" % x
--- 839,859 ----
yield 'received:' + tok
! # As suggested by Anthony Baxter, merely counting the number of
! # header lines, and in a case-sensitive way, has really value.
! # For example, all-caps SUBJECT is a strong spam clue, while
! # X-Complaints-To a strong ham clue.
x2n = {}
! if options.count_all_header_lines:
! for x in msg.keys():
x2n[x] = x2n.get(x, 0) + 1
+ else:
+ # Do a "safe" approximation to that. When spam and ham are
+ # collected from different sources, the count of some header
+ # lines can be a too strong a discriminator for accidental
+ # reasons.
+ safe_headers = options.safe_headers
+ for x in msg.keys():
+ if x.lower() in safe_headers:
+ x2n[x] = x2n.get(x, 0) + 1
for x in x2n.items():
yield "header:%s:%d" % x
From tim_one@users.sourceforge.net Tue Sep 10 19:03:39 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Tue, 10 Sep 2002 11:03:39 -0700
Subject: [Spambayes-checkins] spambayes Options.py,1.6,1.7 bayes.ini,1.3,NONE
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv15128
Modified Files:
Options.py
Removed Files:
bayes.ini
Log Message:
Removed bayes.ini from the project and embedded its contents in Options.py.
This way search-path issues can't stop the correct defaults from getting
set, and people are forced to use the intended bayescustomize.ini for
customization instead of fiddling bayes.ini.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** Options.py 10 Sep 2002 16:02:40 -0000 1.6
--- Options.py 10 Sep 2002 18:03:27 -0000 1.7
***************
*** 11,14 ****
--- 11,74 ----
__all__ = ['options']
+ defaults = """
+ [Tokenizer]
+ # By default, tokenizer.Tokenizer.tokenize_headers() strips HTML tags
+ # from pure text/html messages. Set to True to retain HTML tags in
+ # this case.
+ retain_pure_html_tags: False
+
+ # Generate tokens just counting the number of instances of each kind of
+ # header line, in a case-sensitive way.
+ #
+ # Depending on data collection, some headers aren't safe to count.
+ # For example, if ham is collected from a mailing list but spam from your
+ # regular inbox traffic, the presence of a header like List-Info will be a
+ # very strong ham clue, but a bogus one. In that case, set
+ # count_all_header_lines to False, and adjust safe_headers instead.
+
+ count_all_header_lines: False
+
+ # Like count_all_header_lines, but restricted to headers in this list.
+ # safe_headers is ignored when count_all_header_lines is true.
+
+ safe_headers: abuse-reports-to
+ date
+ errors-to
+ from
+ importance
+ in-reply-to
+ message-id
+ mime-version
+ organization
+ received
+ reply-to
+ return-path
+ subject
+ to
+ user-agent
+ x-abuse-info
+ x-complaints-to
+ x-face
+
+ [TestDriver]
+ # These control various displays in class Drive (timtest.py).
+
+ # Number of buckets in histograms.
+ nbuckets: 40
+ show_histograms: True
+
+ # Display spam when
+ # show_spam_lo <= spamprob <= show_spam_hi
+ # and likewise for ham. The defaults here don't show anything.
+ show_spam_lo: 1.0
+ show_spam_hi: 0.0
+ show_ham_lo: 1.0
+ show_ham_hi: 0.0
+
+ show_false_positives: True
+ show_false_negatives: False
+ show_best_discriminators: True
+ """
+
int_cracker = ('getint', None)
float_cracker = ('getfloat', None)
***************
*** 40,49 ****
def mergefiles(self, fnamelist):
! c = self._config
! c.read(fnamelist)
for section in c.sections():
if section not in all_options:
_warn("config file has unknown section %r" % section)
continue
goodopts = all_options[section]
--- 100,117 ----
def mergefiles(self, fnamelist):
! self._config.read(fnamelist)
! self._update()
!
! def mergefilelike(self, filelike):
! self._config.readfp(filelike)
! self._update()
+ def _update(self):
+ nerrors = 0
+ c = self._config
for section in c.sections():
if section not in all_options:
_warn("config file has unknown section %r" % section)
+ nerrors += 1
continue
goodopts = all_options[section]
***************
*** 52,55 ****
--- 120,124 ----
_warn("config file has unknown option %r in "
"section %r" % (option, section))
+ nerrors += 1
continue
fetcher, converter = goodopts[option]
***************
*** 58,61 ****
--- 127,132 ----
value = converter(value)
setattr(options, option, value)
+ if nerrors:
+ raise ValueError("errors while parsing .ini file")
def display(self):
***************
*** 66,68 ****
options = OptionsClass()
! options.mergefiles(['bayes.ini', 'bayescustomize.ini'])
--- 137,144 ----
options = OptionsClass()
!
! d = StringIO.StringIO(defaults)
! options.mergefilelike(d)
! del d
!
! options.mergefiles(['bayescustomize.ini'])
--- bayes.ini DELETED ---
From tim_one@users.sourceforge.net Tue Sep 10 19:16:42 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Tue, 10 Sep 2002 11:16:42 -0700
Subject: [Spambayes-checkins] spambayes Options.py,1.7,1.8
tokenizer.py,1.14,1.15
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv18467
Modified Files:
Options.py tokenizer.py
Log Message:
tokenize_headers(): Updated some comments.
Added new Tokenizer/mine_received_headers bool option to enable Neil
Schemenauer's special processing of Received: headers.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.7
retrieving revision 1.8
diff -C2 -d -r1.7 -r1.8
*** Options.py 10 Sep 2002 18:03:27 -0000 1.7
--- Options.py 10 Sep 2002 18:15:48 -0000 1.8
***************
*** 51,54 ****
--- 51,59 ----
x-face
+ # A lot of clues can be gotten from IP addresses and names in Received:
+ # headers. Again this can give spectacular results for bogus reasons
+ # if your test corpora are from different sources. Else set this to true.
+ mine_received_headers: False
+
[TestDriver]
# These control various displays in class Drive (timtest.py).
***************
*** 79,82 ****
--- 84,88 ----
'safe_headers': ('get', lambda s: Set(s.split())),
'count_all_header_lines': boolean_cracker,
+ 'mine_received_headers': boolean_cracker,
},
'TestDriver': {'nbuckets': int_cracker,
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.14
retrieving revision 1.15
diff -C2 -d -r1.14 -r1.15
*** tokenizer.py 10 Sep 2002 16:02:41 -0000 1.14
--- tokenizer.py 10 Sep 2002 18:15:49 -0000 1.15
***************
*** 783,786 ****
--- 783,788 ----
# XXX some "safe" header lines are included here, where "safe"
# XXX is specific to my sorry corpora.
+ # XXX Jeremy Hylton also reported good results from the general
+ # XXX header-mining in mboxtest.MyTokenizer.tokenize_headers.
# Content-{Type, Disposition} and their params, and charsets.
***************
*** 815,823 ****
# X-Mailer: This is a pure and significant win for the f-n rate; f-p
# rate isn't affected.
- # User-Agent: Skipping it, as it made no difference. Very few spams
- # had a User-Agent field, but lots of hams didn't either,
- # and the spam probability of User-Agent was very close to
- # 0.5 (== not a valuable discriminator) across all
- # training sets.
for field in ('x-mailer',):
prefix = field + ':'
--- 817,820 ----
***************
*** 826,834 ****
# Received:
! # Neil Schemenauer reported good results from tokenizing prefixes
! # of the embedded IP addresses.
! # XXX This is disabled only because it's "too good" when used on
! # XXX Tim's mixed-source corpora.
! if 0:
for header in msg.get_all("received", ()):
for pat, breakdown in [(received_host_re, breakdown_host),
--- 823,828 ----
# Received:
! # Neil Schemenauer reports good results from this.
! if options.mine_received_headers:
for header in msg.get_all("received", ()):
for pat, breakdown in [(received_host_re, breakdown_host),
***************
*** 840,844 ****
# As suggested by Anthony Baxter, merely counting the number of
! # header lines, and in a case-sensitive way, has really value.
# For example, all-caps SUBJECT is a strong spam clue, while
# X-Complaints-To a strong ham clue.
--- 834,838 ----
# As suggested by Anthony Baxter, merely counting the number of
! # header lines, and in a case-sensitive way, has real value.
# For example, all-caps SUBJECT is a strong spam clue, while
# X-Complaints-To a strong ham clue.
From tim_one@users.sourceforge.net Wed Sep 11 01:22:59 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Tue, 10 Sep 2002 17:22:59 -0700
Subject: [Spambayes-checkins] spambayes Options.py,1.8,1.9
timtest.py,1.19,1.20
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv26264
Modified Files:
Options.py timtest.py
Log Message:
Added options
[TestDriver]
save_trained_pickles: False
pickle_basename: class
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.8
retrieving revision 1.9
diff -C2 -d -r1.8 -r1.9
*** Options.py 10 Sep 2002 18:15:48 -0000 1.8
--- Options.py 11 Sep 2002 00:22:56 -0000 1.9
***************
*** 57,61 ****
[TestDriver]
! # These control various displays in class Drive (timtest.py).
# Number of buckets in histograms.
--- 57,61 ----
[TestDriver]
! # These control various displays in class Driver (timtest.py).
# Number of buckets in histograms.
***************
*** 74,77 ****
--- 74,88 ----
show_false_negatives: False
show_best_discriminators: True
+
+ # If save_trained_pickles is true, Driver.train() saves a binary pickle
+ # of the classifier after training. The file basename is given by
+ # pickle_basename, the extension is .pik, and increasing integers are
+ # appended to pickle_basename. By default (if save_trained_pickles is
+ # true), the filenames are class1.pik, class2.pik, ... If a file of that
+ # name already exists, it's overwritten. pickle_basename is ignored when
+ # save_trained_pickles is false.
+
+ save_trained_pickles: False
+ pickle_basename: class
"""
***************
*** 79,82 ****
--- 90,94 ----
float_cracker = ('getfloat', None)
boolean_cracker = ('getboolean', bool)
+ string_cracker = ('get', None)
all_options = {
***************
*** 95,98 ****
--- 107,112 ----
'show_histograms': boolean_cracker,
'show_best_discriminators': boolean_cracker,
+ 'save_trained_pickles': boolean_cracker,
+ 'pickle_basename': string_cracker,
},
}
Index: timtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtest.py,v
retrieving revision 1.19
retrieving revision 1.20
diff -C2 -d -r1.19 -r1.20
*** timtest.py 10 Sep 2002 01:53:12 -0000 1.19
--- timtest.py 11 Sep 2002 00:22:56 -0000 1.20
***************
*** 152,155 ****
--- 152,156 ----
self.global_ham_hist = Hist(options.nbuckets)
self.global_spam_hist = Hist(options.nbuckets)
+ self.ntimes_train_called = 0
def train(self, ham, spam):
***************
*** 164,172 ****
self.trained_spam_hist = Hist(options.nbuckets)
! #f = file('w.pik', 'wb')
! #pickle.dump(self.classifier, f, 1)
! #f.close()
! #import sys
! #sys.exit(0)
def finishtest(self):
--- 165,176 ----
self.trained_spam_hist = Hist(options.nbuckets)
! self.ntimes_train_called += 1
! if options.save_trained_pickles:
! fname = "%s%d.pik" % (options.pickle_basename,
! self.ntimes_train_called)
! print " saving pickle to", fname
! fp = file(fname, 'wb')
! pickle.dump(self.classifier, fp, 1)
! fp.close()
def finishtest(self):
From rubiconx@users.sourceforge.net Wed Sep 11 07:21:25 2002
From: rubiconx@users.sourceforge.net (Neale Pickett)
Date: Tue, 10 Sep 2002 23:21:25 -0700
Subject: [Spambayes-checkins] spambayes cdb.py,1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv9505
Modified Files:
cdb.py
Log Message:
Added some more dict-like methods to the Cdb class, and a cdb_dump
function that generates output identical to djb's cdbdump program.
Index: cdb.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/cdb.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** cdb.py 9 Sep 2002 21:21:54 -0000 1.1
--- cdb.py 11 Sep 2002 06:21:22 -0000 1.2
***************
*** 1,2 ****
--- 1,3 ----
+ #! /usr/bin/env python
"""
Dan Bernstein's CDB implemented in Python
***************
*** 28,34 ****
--- 29,37 ----
def __init__(self, fp):
+ self.fp = fp
fd = fp.fileno()
self.size = os.fstat(fd).st_size
self.map = mmap.mmap(fd, self.size, access=mmap.ACCESS_READ)
+ self.eod = uint32_unpack(self.map[:4])
self.findstart()
self.loop = 0 # number of hash slots searched under this key
***************
*** 44,47 ****
--- 47,92 ----
self.map.close()
+ def __iter__(self, fn=None):
+ len = 2048
+ ret = []
+ while len < self.eod:
+ klen, vlen = struct.unpack("%s" % (len(key), len(value), key, value)
+ print
def cdb_make(outfile, items):
From rubiconx@users.sourceforge.net Wed Sep 11 07:58:06 2002
From: rubiconx@users.sourceforge.net (Neale Pickett)
Date: Tue, 10 Sep 2002 23:58:06 -0700
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.15,1.16
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv31437
Modified Files:
tokenizer.py
Log Message:
textparts() now makes a copy of payloads. This keeps the tokenizer
from fouling up the message object's payload(s).
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.15
retrieving revision 1.16
diff -C2 -d -r1.15 -r1.16
*** tokenizer.py 10 Sep 2002 18:15:49 -0000 1.15
--- tokenizer.py 11 Sep 2002 06:58:03 -0000 1.16
***************
*** 506,510 ****
# part to redundant_html.
htmlpart = textpart = None
! stack = part.get_payload()
while stack:
subpart = stack.pop()
--- 506,510 ----
# part to redundant_html.
htmlpart = textpart = None
! stack = part.get_payload()[:]
while stack:
subpart = stack.pop()
From tim_one@users.sourceforge.net Thu Sep 12 01:16:09 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Wed, 11 Sep 2002 17:16:09 -0700
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.16,1.17
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv2055
Modified Files:
tokenizer.py
Log Message:
Added code to strip uuencoded sections. As reported on the mailing list,
this has no effect on my results, except that one spam in now judged as
ham by all the other training sets. It shrinks the database size by a
few percent, so that makes it a tiny win. If Anthony Baxter doesn't
report better results on his data, I'll be sorely tempted to throw this
out again.
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.16
retrieving revision 1.17
diff -C2 -d -r1.16 -r1.17
*** tokenizer.py 11 Sep 2002 06:58:03 -0000 1.16
--- tokenizer.py 12 Sep 2002 00:16:07 -0000 1.17
***************
*** 747,750 ****
--- 747,787 ----
yield '.'.join(parts[:i])
+ uuencode_begin_re = re.compile(r"""
+ ^begin \s+
+ (\S+) \s+ # capture mode
+ (\S+) \s* # capture filename
+ $
+ """, re.VERBOSE | re.MULTILINE)
+
+ uuencode_end_re = re.compile(r"^end\s*\n", re.MULTILINE)
+
+ # Strip out uuencoded sections and produce tokens. The return value
+ # is (new_text, sequence_of_tokens), where new_text no longer contains
+ # uuencoded stuff. Note that we're not bothering to decode it! Maybe
+ # we should.
+ def crack_uuencode(text):
+ new_text = []
+ tokens = []
+ i = 0
+ while True:
+ # Invariant: Through text[:i], all non-uuencoded text is in
+ # new_text, and tokens contains summary clues for all uuencoded
+ # portions. text[i:] hasn't been looked at yet.
+ m = uuencode_begin_re.search(text, i)
+ if not m:
+ new_text.append(text[i:])
+ break
+ start, end = m.span()
+ new_text.append(text[i : start])
+ mode, fname = m.groups()
+ tokens.append('uuencode mode:%s' % mode)
+ tokens.extend(['uuencode:%s' % x for x in crack_filename(fname)])
+ m = uuencode_end_re.search(text, end)
+ if not m:
+ break
+ i = m.end()
+
+ return ''.join(new_text), tokens
+
class Tokenizer:
***************
*** 881,884 ****
--- 918,926 ----
# Normalize case.
text = text.lower()
+
+ # Get rid of uuencoded sections.
+ text, tokens = crack_uuencode(text)
+ for t in tokens:
+ yield t
# Special tagging of embedded URLs.
From tim_one@users.sourceforge.net Thu Sep 12 03:46:17 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Wed, 11 Sep 2002 19:46:17 -0700
Subject: [Spambayes-checkins] spambayes Options.py,1.9,1.10
mboxtest.py,1.2,1.3timtest.py,1.20,1.21 tokenizer.py,1.17,1.18
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv5892
Modified Files:
Options.py mboxtest.py timtest.py tokenizer.py
Log Message:
Added option TestDriver/show_charlimit to put a bound on the length
of displayed msgs. Default is 5000. The similar cmdline option to
mboxtest has gone away.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.9
retrieving revision 1.10
diff -C2 -d -r1.9 -r1.10
*** Options.py 11 Sep 2002 00:22:56 -0000 1.9
--- Options.py 12 Sep 2002 02:46:15 -0000 1.10
***************
*** 75,78 ****
--- 75,82 ----
show_best_discriminators: True
+ # The maximum # of characters to display for a msg displayed due to the
+ # show_xyz options above.
+ show_charlimit: 3000
+
# If save_trained_pickles is true, Driver.train() saves a binary pickle
# of the classifier after training. The file basename is given by
***************
*** 109,112 ****
--- 113,117 ----
'save_trained_pickles': boolean_cracker,
'pickle_basename': string_cracker,
+ 'show_charlimit': int_cracker,
},
}
Index: mboxtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/mboxtest.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** mboxtest.py 7 Sep 2002 16:17:19 -0000 1.2
--- mboxtest.py 12 Sep 2002 02:46:15 -0000 1.3
***************
*** 17,23 ****
-m MSGS
Read no more than MSGS messages from mailbox.
-
- -l LIMIT
- Print no more than LIMIT characters of a message in test output.
"""
--- 17,20 ----
***************
*** 137,142 ****
SEED = 101
MAXMSGS = None
! CHARLIMIT = 1000
! opts, args = getopt.getopt(args, "f:n:s:l:m:")
for k, v in opts:
if k == '-f':
--- 134,138 ----
SEED = 101
MAXMSGS = None
! opts, args = getopt.getopt(args, "f:n:s:m:")
for k, v in opts:
if k == '-f':
***************
*** 146,151 ****
if k == '-s':
SEED = int(v)
- if k == '-l':
- CHARLIMIT = int(v)
if k == '-m':
MAXMSGS = int(v)
--- 142,145 ----
***************
*** 177,181 ****
if (iham, ispam) == (ihtest, istest):
continue
! driver.test(mbox(ham, ihtest), mbox(spam, istest), CHARLIMIT)
driver.finishtest()
driver.alldone()
--- 171,175 ----
if (iham, ispam) == (ihtest, istest):
continue
! driver.test(mbox(ham, ihtest), mbox(spam, istest))
driver.finishtest()
driver.alldone()
Index: timtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtest.py,v
retrieving revision 1.20
retrieving revision 1.21
diff -C2 -d -r1.20 -r1.21
*** timtest.py 11 Sep 2002 00:22:56 -0000 1.20
--- timtest.py 12 Sep 2002 02:46:15 -0000 1.21
***************
*** 81,85 ****
spam.display()
! def printmsg(msg, prob, clues, charlimit=None):
print msg.tag
print "prob =", prob
--- 81,85 ----
spam.display()
! def printmsg(msg, prob, clues):
print msg.tag
print "prob =", prob
***************
*** 88,93 ****
print
guts = str(msg)
! if charlimit is not None:
! guts = guts[:charlimit]
print guts
--- 88,93 ----
print
guts = str(msg)
! if options.show_charlimit > 0:
! guts = guts[:options.show_charlimit]
print guts
***************
*** 185,189 ****
printhist("all runs:", self.global_ham_hist, self.global_spam_hist)
! def test(self, ham, spam, charlimit=None):
c = self.classifier
t = self.tester
--- 185,189 ----
printhist("all runs:", self.global_ham_hist, self.global_spam_hist)
! def test(self, ham, spam):
c = self.classifier
t = self.tester
***************
*** 198,202 ****
print "Ham with prob =", prob
prob, clues = c.spamprob(msg, True)
! printmsg(msg, prob, clues, charlimit)
def new_spam(msg, prob, lo=options.show_spam_lo,
--- 198,202 ----
print "Ham with prob =", prob
prob, clues = c.spamprob(msg, True)
! printmsg(msg, prob, clues)
def new_spam(msg, prob, lo=options.show_spam_lo,
***************
*** 207,211 ****
print "Spam with prob =", prob
prob, clues = c.spamprob(msg, True)
! printmsg(msg, prob, clues, charlimit)
t.reset_test_results()
--- 207,211 ----
print "Spam with prob =", prob
prob, clues = c.spamprob(msg, True)
! printmsg(msg, prob, clues)
t.reset_test_results()
***************
*** 226,230 ****
print '*' * 78
prob, clues = c.spamprob(e, True)
! printmsg(e, prob, clues, charlimit)
newfneg = Set(t.false_negatives()) - self.falseneg
--- 226,230 ----
print '*' * 78
prob, clues = c.spamprob(e, True)
! printmsg(e, prob, clues)
newfneg = Set(t.false_negatives()) - self.falseneg
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.17
retrieving revision 1.18
diff -C2 -d -r1.17 -r1.18
*** tokenizer.py 12 Sep 2002 00:16:07 -0000 1.17
--- tokenizer.py 12 Sep 2002 02:46:15 -0000 1.18
***************
*** 613,617 ****
for i in xrange(n-4):
yield "5g:" + word[i : i+5]
!
else:
# It's a long string of "normal" chars. Ignore it.
--- 613,634 ----
for i in xrange(n-4):
yield "5g:" + word[i : i+5]
! """
! # If there are any high-bit chars, tokenize it as byte 3-grams.
! # XXX This really won't work for high-bit languages -- the scoring
! # XXX scheme throws almost everything away, and one bad phrase can
! # XXX generate enough bad 3-grams to dominate the final score.
! # XXX This also increases the database size substantially.
! elif has_highbit_char(word):
! counthi = 0
! ch1 = ch2 = ''
! for ch in word:
! if ord(ch) >= 128:
! counthi += 1
! yield "3g:%s" % (ch1 + ch2 + ch)
! ch1 = ch2
! ch2 = ch
! ratio = round(counthi * 20.0 / len(word)) * 5
! yield "8bit%%:%d" % ratio
! """
else:
# It's a long string of "normal" chars. Ignore it.
From tim_one@users.sourceforge.net Thu Sep 12 03:58:04 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Wed, 11 Sep 2002 19:58:04 -0700
Subject: [Spambayes-checkins] spambayes timtest.py,1.21,1.22
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv8497
Modified Files:
timtest.py
Log Message:
Missed a call to printmsg that was still passing a charlimit
(show_charlimit is an option now).
Index: timtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtest.py,v
retrieving revision 1.21
retrieving revision 1.22
diff -C2 -d -r1.21 -r1.22
*** timtest.py 12 Sep 2002 02:46:15 -0000 1.21
--- timtest.py 12 Sep 2002 02:58:02 -0000 1.22
***************
*** 236,240 ****
print '*' * 78
prob, clues = c.spamprob(e, True)
! printmsg(e, prob, clues, 1000)
if options.show_best_discriminators:
--- 236,240 ----
print '*' * 78
prob, clues = c.spamprob(e, True)
! printmsg(e, prob, clues)
if options.show_best_discriminators:
From tim_one@users.sourceforge.net Thu Sep 12 05:19:41 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Wed, 11 Sep 2002 21:19:41 -0700
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.18,1.19
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv26685
Modified Files:
tokenizer.py
Log Message:
Two things:
1) Gave up on 5-gram'ming of long words w/ high-bit chars. This approach
didn't make sense for high-bit languages regardless, and the results
here show it wasn't doing any good that couldn't be gotten cheaper.
There may even be a slight f-n rate improvement now. This also chops
about 2MB off the database size on my runs.
2) Removed http:// etc thingies; they're already getting parsed specially.
Leaving them in the body of the text was likely to lead to redundant
"skip:< nn" and "skip:h nn" tokens, giving an artificial boost (whether
towards ham or spam doesn't matter) to msgs simply containing URLs.
I still need to fix now-out-of-date comments.
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.050 0.050 tied
0.000 0.000 tied
0.025 0.025 tied
0.000 0.000 tied
0.075 0.075 tied
0.025 0.025 tied
0.025 0.025 tied
0.000 0.000 tied
0.050 0.050 tied
0.000 0.000 tied
0.025 0.025 tied
0.000 0.000 tied
0.000 0.000 tied
0.050 0.050 tied
0.025 0.025 tied
0.000 0.000 tied
0.025 0.025 tied
0.050 0.050 tied
won 0 times
tied 20 times
lost 0 times
total unique fp went from 8 to 8 tied
false negative percentages
0.255 0.218 won -14.51%
0.364 0.364 tied
0.291 0.291 tied
0.509 0.509 tied
0.436 0.400 won -8.26%
0.218 0.218 tied
0.218 0.218 tied
0.582 0.582 tied
0.327 0.291 won -11.01%
0.255 0.255 tied
0.291 0.291 tied
0.582 0.582 tied
0.545 0.545 tied
0.255 0.255 tied
0.291 0.255 won -12.37%
0.400 0.400 tied
0.291 0.291 tied
0.218 0.218 tied
0.218 0.182 won -16.51%
0.182 0.145 won -20.33%
won 6 times
tied 14 times
lost 0 times
total unique fn went from 90 to 86 won -4.44%
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.18
retrieving revision 1.19
diff -C2 -d -r1.18 -r1.19
*** tokenizer.py 12 Sep 2002 02:46:15 -0000 1.18
--- tokenizer.py 12 Sep 2002 04:19:38 -0000 1.19
***************
*** 588,592 ****
def tokenize_word(word, _len=len):
n = _len(word)
-
# Make sure this range matches in tokenize().
if 3 <= n <= 12:
--- 588,591 ----
***************
*** 604,638 ****
yield 'email addr:' + p2
- # If there are any high-bit chars,
- # tokenize it as byte 5-grams.
- # XXX This really won't work for high-bit languages -- the scoring
- # XXX scheme throws almost everything away, and one bad phrase can
- # XXX generate enough bad 5-grams to dominate the final score.
- # XXX This also increases the database size substantially.
- elif has_highbit_char(word):
- for i in xrange(n-4):
- yield "5g:" + word[i : i+5]
- """
- # If there are any high-bit chars, tokenize it as byte 3-grams.
- # XXX This really won't work for high-bit languages -- the scoring
- # XXX scheme throws almost everything away, and one bad phrase can
- # XXX generate enough bad 3-grams to dominate the final score.
- # XXX This also increases the database size substantially.
- elif has_highbit_char(word):
- counthi = 0
- ch1 = ch2 = ''
- for ch in word:
- if ord(ch) >= 128:
- counthi += 1
- yield "3g:%s" % (ch1 + ch2 + ch)
- ch1 = ch2
- ch2 = ch
- ratio = round(counthi * 20.0 / len(word)) * 5
- yield "8bit%%:%d" % ratio
- """
else:
- # It's a long string of "normal" chars. Ignore it.
- # For example, it may be an embedded URL (which we already
- # tagged), or a uuencoded line.
# There's value in generating a token indicating roughly how
# many chars were skipped. This has real benefit for the f-n
--- 603,607 ----
***************
*** 641,644 ****
--- 610,619 ----
# XXX this info has greater benefit.
yield "skip:%c %d" % (word[0], n // 10 * 10)
+ if has_highbit_char(word):
+ hicount = 0
+ for i in map(ord, word):
+ if i >= 128:
+ hicount += 1
+ yield "8bit%%:%d" % round(hicount * 100.0 / len(word))
# Generate tokens for:
***************
*** 801,804 ****
--- 776,814 ----
return ''.join(new_text), tokens
+ def crack_urls(text):
+ new_text = []
+ clues = []
+ pushclue = clues.append
+ i = 0
+ while True:
+ # Invariant: Through text[:i], all non-URL text is in new_text, and
+ # clues contains clues for all URLs. text[i:] hasn't been looked at
+ # yet.
+ m = url_re.search(text, i)
+ if not m:
+ new_text.append(text[i:])
+ break
+ proto, guts = m.groups()
+ start, end = m.span()
+ new_text.append(text[i : start])
+ new_text.append(' ')
+
+ pushclue("proto:" + proto)
+ # Lose the trailing punctuation for casual embedding, like:
+ # The code is at http://mystuff.org/here? Didn't resolve.
+ # or
+ # I found it at http://mystuff.org/there/. Thanks!
+ assert guts
+ while guts and guts[-1] in '.:?!/':
+ guts = guts[:-1]
+ for i, piece in enumerate(guts.split('/')):
+ prefix = "%s%s:" % (proto, i < 2 and str(i) or '>1')
+ for chunk in urlsep_re.split(piece):
+ pushclue(prefix + chunk)
+
+ i = end
+
+ return ''.join(new_text), clues
+
class Tokenizer:
***************
*** 942,958 ****
# Special tagging of embedded URLs.
! for proto, guts in url_re.findall(text):
! yield "proto:" + proto
! # Lose the trailing punctuation for casual embedding, like:
! # The code is at http://mystuff.org/here? Didn't resolve.
! # or
! # I found it at http://mystuff.org/there/. Thanks!
! assert guts
! while guts and guts[-1] in '.:?!/':
! guts = guts[:-1]
! for i, piece in enumerate(guts.split('/')):
! prefix = "%s%s:" % (proto, i < 2 and str(i) or '>1')
! for chunk in urlsep_re.split(piece):
! yield prefix + chunk
# Anthony Baxter reported goodness from tokenizing src= params.
--- 952,958 ----
# Special tagging of embedded URLs.
! text, tokens = crack_urls(text)
! for t in tokens:
! yield t
# Anthony Baxter reported goodness from tokenizing src= params.
From gvanrossum@users.sourceforge.net Thu Sep 12 06:10:04 2002
From: gvanrossum@users.sourceforge.net (Guido van Rossum)
Date: Wed, 11 Sep 2002 22:10:04 -0700
Subject: [Spambayes-checkins] spambayes hammie.py,1.15,1.16
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv6989
Modified Files:
hammie.py
Log Message:
Use the _mh_msgno feature I just added to Python 2.3's
mailbox.MHMailbox class, if available, to report the correct message
number for spams in -u mode.
Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.15
retrieving revision 1.16
diff -C2 -d -r1.15 -r1.16
*** hammie.py 8 Sep 2002 03:20:18 -0000 1.15
--- hammie.py 12 Sep 2002 05:10:02 -0000 1.16
***************
*** 251,257 ****
prob, clues = bayes.spamprob(tokenize(msg), True)
isspam = prob >= 0.9
if isspam:
spams += 1
! print "%6s %4.2f %1s" % (i, prob, isspam and "S" or "."),
print formatclues(clues)
else:
--- 251,261 ----
prob, clues = bayes.spamprob(tokenize(msg), True)
isspam = prob >= 0.9
+ if hasattr(msg, '_mh_msgno'):
+ msgno = msg._mh_msgno
+ else:
+ msgno = i
if isspam:
spams += 1
! print "%6s %4.2f %1s" % (msgno, prob, isspam and "S" or "."),
print formatclues(clues)
else:
From anthony@interlink.com.au Thu Sep 12 08:13:20 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Thu, 12 Sep 2002 17:13:20 +1000
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.16,1.17
In-Reply-To:
Message-ID: <200209120713.g8C7DLj24609@localhost.localdomain>
>>> "Tim Peters" wrote
> Modified Files:
> tokenizer.py
> Log Message:
> Added code to strip uuencoded sections. As reported on the mailing list,
> this has no effect on my results, except that one spam in now judged as
> ham by all the other training sets. It shrinks the database size by a
> few percent, so that makes it a tiny win. If Anthony Baxter doesn't
> report better results on his data, I'll be sorely tempted to throw this
> out again.
I'd say nuke it:
anthony_tok1.16s -> anthony_tok1.17s
false positive percentages
0.778 0.778 tied
0.834 0.778 won -6.71%
0.890 0.890 tied
0.667 0.611 won -8.40%
1.112 1.112 tied
0.834 0.834 tied
0.723 0.723 tied
0.667 0.611 won -8.40%
1.167 1.167 tied
1.001 1.001 tied
0.779 0.779 tied
0.667 0.611 won -8.40%
0.778 0.778 tied
0.778 0.778 tied
0.556 0.556 tied
0.778 0.723 won -7.07%
0.611 0.611 tied
0.778 0.778 tied
0.723 0.723 tied
0.667 0.667 tied
won 5 times
tied 15 times
lost 0 times
total unique fp went from 143 to 141 won -1.40%
false negative percentages
0.646 0.646 tied
0.904 0.904 tied
0.517 0.581 lost +12.38%
1.229 1.229 tied
0.840 0.840 tied
1.033 1.033 tied
0.711 0.775 lost +9.00%
1.164 1.164 tied
0.646 0.646 tied
0.711 0.711 tied
0.646 0.711 lost +10.06%
0.517 0.517 tied
0.776 0.776 tied
0.646 0.646 tied
0.904 0.904 tied
1.035 1.035 tied
0.582 0.582 tied
0.581 0.581 tied
0.775 0.775 tied
0.646 0.646 tied
won 0 times
tied 17 times
lost 3 times
From rubiconx@users.sourceforge.net Thu Sep 12 08:24:55 2002
From: rubiconx@users.sourceforge.net (Neale Pickett)
Date: Thu, 12 Sep 2002 00:24:55 -0700
Subject: [Spambayes-checkins] spambayes cdbhammie.py,NONE,1.1
cdbwrap.py,NONE,1.1
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv5165
Added Files:
cdbhammie.py cdbwrap.py
Log Message:
A version of hammie to use CDB. Something may be wrong with it--the
databases it creates are *gargantuan*. But it works.
--- NEW FILE: cdbhammie.py ---
#! /usr/bin/env python
# At the moment, this requires Python 2.3 from CVS
# A driver for the classifier module and Tim's tokenizer that you can
# call from procmail. This one uses Neil's cdb module. Will it be
# faster than Berkeley DB hashes?
"""Usage: %(program)s [options]
Where:
-h
show usage and exit
-g PATH
mbox or directory of known good messages (non-spam) to train on.
-s PATH
mbox or directory of known spam messages to train on.
-u PATH
mbox of unknown messages. A ham/spam decision is reported for each.
-p FILE
use file as the persistent store. loads data from this file if it
exists, and saves data to this file at the end. Default: %(DEFAULTDB)s
-f
run as a filter: read a single message from stdin, add an
%(DISPHEADER)s header, and write it to stdout.
"""
import sys
import os
import getopt
import mailbox
import glob
import email
import classifier
import errno
import cdb
import cPickle as pickle
program = sys.argv[0] # For usage(); referenced by docstring above
# Name of the header to add in filter mode
DISPHEADER = "X-Hammie-Disposition"
# Default database name
DEFAULTDB = "hammie.db"
# Tim's tokenizer kicks far more booty than anything I would have
# written. Score one for analysis ;)
from tokenizer import tokenize
from cdbwrap import CDBShelf
class CDBDict(CDBShelf):
"""Constant Database Dictionary
This wraps a cdb to make it look even more like a dictionary.
Call it with the name of your database file. Optionally, you can
specify a list of keys to skip when iterating. This only affects
iterators; things like .keys() still list everything. For instance:
>>> d = DBDict('/tmp/goober.db', ('skipme', 'skipmetoo'))
>>> d['skipme'] = 'booga'
>>> d['countme'] = 'wakka'
>>> print d.keys()
['skipme', 'countme']
>>> for k in d.iterkeys():
... print k
countme
"""
def __init__(self, dbname, iterskip=()):
CDBShelf.__init__(self, dbname)
self.iterskip = iterskip
def __iter__(self, fn=lambda k,v: (k,v)):
for key in self.dict.iterkeys():
val = self.get(key)
if key not in self.iterskip:
yield fn(key, val)
def __setitem__(self, key, value):
v = pickle.dumps(value, 1)
self.dict[key] = v
def iteritems(self):
return self.__iter__()
def iterkeys(self):
return self.__iter__(lambda k,v: k)
def itervalues(self):
return self.__iter__(lambda k,v: v)
def items(self):
ret = []
for i in self.iteritems():
ret.append(i)
return ret
def keys(self):
ret = []
for i in self.iterkeys():
ret.append(i)
return ret
def values(self):
ret = []
for i in self.itervalues():
ret.append(i)
return ret
def __contains__(self, name):
return self.has_key(name)
class PersistentGrahamBayes(classifier.GrahamBayes):
"""A persistent GrahamBayes classifier
This is just like classifier.GrahamBayes, except that the dictionary
is a database. You take less disk this way, I think, and you can
pretend it's persistent. It's much slower training, but much faster
checking, and takes less memory all around.
On destruction, an instantiation of this class will write it's state
to a special key. When you instantiate a new one, it will attempt
to read these values out of that key again, so you can pick up where
you left off.
"""
# XXX: Would it be even faster to remember (in a list) which keys
# had been modified, and only recalculate those keys? No sense in
# going over the entire word database if only 100 words are
# affected.
# XXX: Another idea: cache stuff in memory. But by then maybe we
# should just use ZODB.
def __init__(self, dbname):
classifier.GrahamBayes.__init__(self)
self.statekey = "saved state"
self.wordinfo = CDBDict(dbname, (self.statekey,))
self.restore_state()
def __del__(self):
#super.__del__(self)
self.save_state()
def save_state(self):
self.wordinfo[self.statekey] = (self.nham, self.nspam)
def restore_state(self):
if self.wordinfo.has_key(self.statekey):
self.nham, self.nspam = self.wordinfo[self.statekey]
class DirOfTxtFileMailbox:
"""Mailbox directory consisting of .txt files."""
def __init__(self, dirname, factory):
self.names = glob.glob(os.path.join(dirname, "*.txt"))
self.factory = factory
def __iter__(self):
for name in self.names:
try:
f = open(name)
except IOError:
continue
yield self.factory(f)
f.close()
def getmbox(msgs):
"""Return an iterable mbox object given a file/directory/folder name."""
def _factory(fp):
try:
return email.message_from_file(fp)
except email.Errors.MessageParseError:
return ''
if msgs.startswith("+"):
import mhlib
mh = mhlib.MH()
mbox = mailbox.MHMailbox(os.path.join(mh.getpath(), msgs[1:]),
_factory)
elif os.path.isdir(msgs):
# XXX Bogus: use an MHMailbox if the pathname contains /Mail/,
# else a DirOfTxtFileMailbox.
if msgs.find("/Mail/") >= 0:
mbox = mailbox.MHMailbox(msgs, _factory)
else:
mbox = DirOfTxtFileMailbox(msgs, _factory)
else:
fp = open(msgs)
mbox = mailbox.PortableUnixMailbox(fp, _factory)
return mbox
def train(bayes, msgs, is_spam):
"""Train bayes with all messages from a mailbox."""
mbox = getmbox(msgs)
i = 0
for msg in mbox:
i += 1
# XXX: Is the \r a Unixism? I seem to recall it working in DOS
# back in the day. Maybe it's a line-printer-ism ;)
sys.stdout.write("\r%6d" % i)
sys.stdout.flush()
bayes.learn(tokenize(msg), is_spam, False)
print
def formatclues(clues, sep="; "):
"""Format the clues into something readable."""
return sep.join(["%r: %.2f" % (word, prob) for word, prob in clues])
def filter(bayes, input, output):
"""Filter (judge) a message"""
msg = email.message_from_file(input)
prob, clues = bayes.spamprob(tokenize(msg), True)
if prob < 0.9:
disp = "No"
else:
disp = "Yes"
disp += "; %.2f" % prob
disp += "; " + formatclues(clues)
msg.add_header(DISPHEADER, disp)
output.write(msg.as_string(unixfrom=(msg.get_unixfrom() is not None)))
def score(bayes, msgs):
"""Score (judge) all messages from a mailbox."""
# XXX The reporting needs work!
mbox = getmbox(msgs)
i = 0
spams = hams = 0
for msg in mbox:
i += 1
prob, clues = bayes.spamprob(tokenize(msg), True)
isspam = prob >= 0.9
if isspam:
spams += 1
print "%6s %4.2f %1s" % (i, prob, isspam and "S" or "."),
print formatclues(clues)
else:
hams += 1
print "Total %d spam, %d ham" % (spams, hams)
def usage(code, msg=''):
"""Print usage message and sys.exit(code)."""
if msg:
print >> sys.stderr, msg
print >> sys.stderr
print >> sys.stderr, __doc__ % globals()
sys.exit(code)
def main():
"""Main program; parse options and go."""
try:
opts, args = getopt.getopt(sys.argv[1:], 'hdfg:s:p:u:')
except getopt.error, msg:
usage(2, msg)
if not opts:
usage(2, "No options given")
pck = DEFAULTDB
good = spam = unknown = None
do_filter = usedb = False
for opt, arg in opts:
if opt == '-h':
usage(0)
elif opt == '-g':
good = arg
elif opt == '-s':
spam = arg
elif opt == '-p':
pck = arg
elif opt == "-d":
usedb = True
elif opt == "-f":
do_filter = True
elif opt == '-u':
unknown = arg
if args:
usage(2, "Positional arguments not allowed")
save = False
if usedb:
bayes = PersistentGrahamBayes(pck)
else:
bayes = None
try:
fp = open(pck, 'rb')
except IOError, e:
if e.errno <> errno.ENOENT: raise
else:
bayes = pickle.load(fp)
fp.close()
if bayes is None:
bayes = classifier.GrahamBayes()
if good:
print "Training ham:"
train(bayes, good, False)
save = True
if spam:
print "Training spam:"
train(bayes, spam, True)
save = True
if save:
bayes.update_probabilities()
if not usedb and pck:
fp = open(pck, 'wb')
pickle.dump(bayes, fp, 1)
fp.close()
if do_filter:
filter(bayes, sys.stdin, sys.stdout)
if unknown:
score(bayes, unknown)
if __name__ == "__main__":
main()
--- NEW FILE: cdbwrap.py ---
#! /usr/bin/env python
import cdb
import tempfile
import struct
import time
import os
import shelve
from sets import Set
class DELITEM:
# Special class to signify a deleted item
pass
class CDBDict:
def __init__(self, filename):
self.filename = filename
try:
self.fp = open(filename, "rb")
self.db = cdb.Cdb(self.fp)
except:
self.fp = None
self.db = {}
self.cache = {}
self.newkeys = []
def __delitem__(self, key):
self[key] = DELITEM
def __getitem__(self, key):
val = self.cache.get(key)
if val is DELITEM:
raise KeyError, key
if not val and self.db:
val = self.db[key]
return val
def __setitem__(self, key, val):
self.cache[key] = val
if not self.db.get(key):
self.newkeys.append(key)
def __del__(self):
if self.cache:
import cdb
if 1:
newf = "%s.txt" % self.filename
fp = open(newf, "wb")
for key,value in self.iteritems():
fp.write("+%d,%d:%s->%s\n" % (len(key), len(value), key, value))
fp.write("\n")
fp.close()
else:
# XXX: security risk, but how to do this without the symlink
# problem?
newf = "%s-%f" % (self.filename, time.time())
fp = open(newf, "wb")
cdb.cdb_make(fp, self.iteritems())
fp.close()
os.rename(newf, self.filename)
def __iter__(self, fn=lambda k,v: (k,v)):
for key in self.newkeys:
val = self.cache[key]
if val is DELITEM:
continue
else:
yield fn(key, val)
for key,val in self.db.iteritems():
nval = self.cache.get(key)
if nval:
if nval is DELITEM:
continue
else:
yield fn(key, nval)
else:
yield fn(key, val)
def __contains__(self, key):
return self.has_key(key)
def iteritems(self):
return self.__iter__()
def iterkeys(self):
return self.__iter__(lambda k,v: k)
def itervalues(self):
return self.__iter__(lambda k,v: v)
def items(self):
ret = []
for i in self.iteritems():
ret.append(i)
return ret
def keys(self):
ret = []
for i in self.iterkeys():
ret.append(i)
return ret
def values(self):
ret = []
for i in self.itervalues():
ret.append(i)
return ret
def get(self, key, default=None):
try:
val = self[key]
except KeyError:
val = default
return val
def has_key(self, key):
return self.get(key) and True
class CDBShelf(shelve.Shelf):
"""Shelf implementation using a Constant Database.
This is initialized with the filename for the CDB database. See the
shelf module's __doc__ string for an overview of the interface.
"""
def __init__(self, filename, flag='c'):
db = CDBDict(filename)
shelve.Shelf.__init__(self, db)
def test_shelf():
s = CDBShelf("shelf.cdb")
print "foo ->", s.get("foo")
s["foo"] = s.get("foo", 1.0) + .1
print "foo ->", s.get("foo")
def test_dict():
db = CDBDict("services.cdb")
one = db.get("1")
if one:
print 'db["1"] == %s; deleting' % one
del db["1"]
else:
print 'db["1"] not set; setting'
db["1"] = "One"
print "New value is", db.get("1")
if __name__ == "__main__":
test_shelf()
test_dict()
From rubiconx@users.sourceforge.net Thu Sep 12 08:28:38 2002
From: rubiconx@users.sourceforge.net (Neale Pickett)
Date: Thu, 12 Sep 2002 00:28:38 -0700
Subject: [Spambayes-checkins] spambayes cdbhammie.py,1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv7172
Modified Files:
cdbhammie.py
Log Message:
You don't need to specify -d to cdbhammie anymore. That is, it now
works as advertised :)
Index: cdbhammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/cdbhammie.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** cdbhammie.py 12 Sep 2002 07:24:53 -0000 1.1
--- cdbhammie.py 12 Sep 2002 07:28:36 -0000 1.2
***************
*** 278,283 ****
elif opt == '-p':
pck = arg
- elif opt == "-d":
- usedb = True
elif opt == "-f":
do_filter = True
--- 278,281 ----
***************
*** 289,305 ****
save = False
! if usedb:
! bayes = PersistentGrahamBayes(pck)
! else:
! bayes = None
! try:
! fp = open(pck, 'rb')
! except IOError, e:
! if e.errno <> errno.ENOENT: raise
! else:
! bayes = pickle.load(fp)
! fp.close()
! if bayes is None:
! bayes = classifier.GrahamBayes()
if good:
--- 287,291 ----
save = False
! bayes = PersistentGrahamBayes(pck)
if good:
***************
*** 314,321 ****
if save:
bayes.update_probabilities()
- if not usedb and pck:
- fp = open(pck, 'wb')
- pickle.dump(bayes, fp, 1)
- fp.close()
if do_filter:
--- 300,303 ----
From tim.one@comcast.net Thu Sep 12 15:47:47 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 12 Sep 2002 10:47:47 -0400
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.16,1.17
In-Reply-To: <200209120713.g8C7DLj24609@localhost.localdomain>
Message-ID:
[Tim]
>> Modified Files:
>> tokenizer.py
>> Log Message:
>> Added code to strip uuencoded sections. As reported on the mailing list,
>> this has no effect on my results, except that one spam in now judged as
>> ham by all the other training sets. It shrinks the database size by a
>> few percent, so that makes it a tiny win. If Anthony Baxter doesn't
>> report better results on his data, I'll be sorely tempted to throw this
>> out again.
[Anthony Baxter]
> I'd say nuke it:
>
> false positive percentages
> 0.778 0.778 tied
> 0.834 0.778 won -6.71%
> 0.890 0.890 tied
> 0.667 0.611 won -8.40%
> 1.112 1.112 tied
> 0.834 0.834 tied
> 0.723 0.723 tied
> 0.667 0.611 won -8.40%
> 1.167 1.167 tied
> 1.001 1.001 tied
> 0.779 0.779 tied
> 0.667 0.611 won -8.40%
> 0.778 0.778 tied
> 0.778 0.778 tied
> 0.556 0.556 tied
> 0.778 0.723 won -7.07%
> 0.611 0.611 tied
> 0.778 0.778 tied
> 0.723 0.723 tied
> 0.667 0.667 tied
>
> won 5 times
> tied 15 times
> lost 0 times
>
> total unique fp went from 143 to 141 won -1.40%
>
> false negative percentages
> 0.646 0.646 tied
> 0.904 0.904 tied
> 0.517 0.581 lost +12.38%
> 1.229 1.229 tied
> 0.840 0.840 tied
> 1.033 1.033 tied
> 0.711 0.775 lost +9.00%
> 1.164 1.164 tied
> 0.646 0.646 tied
> 0.711 0.711 tied
> 0.646 0.711 lost +10.06%
> 0.517 0.517 tied
> 0.776 0.776 tied
> 0.646 0.646 tied
> 0.904 0.904 tied
> 1.035 1.035 tied
> 0.582 0.582 tied
> 0.581 0.581 tied
> 0.775 0.775 tied
> 0.646 0.646 tied
>
> won 0 times
> tied 17 times
> lost 3 times
So there's one spam in your Set4 that gets through when scored by Sets 1, 2
or 3 now, but two hams that are no longer called spam by any training set.
That's a small win, so I'm inclined to leave it in after all (it's a cheap
transformation, and keeps a bunch of useless "skip" tokens out of the
database).
From montanaro@users.sourceforge.net Thu Sep 12 20:33:56 2002
From: montanaro@users.sourceforge.net (Skip Montanaro)
Date: Thu, 12 Sep 2002 12:33:56 -0700
Subject: [Spambayes-checkins] spambayes rebal.py,1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv1160
Modified Files:
rebal.py
Log Message:
nearly complete rewrite which attempts to achieve the following:
* allows specification of reservoir directory and prefix of set
directories
* will automatically fill any set directories which match the -s pattern
* will migrate files in either direction - in theory, no files should be
deleted
* should be a bit more efficient so varying the numbers of trained ham
and spam shouldn't be a big problem
With no args it should work like the original
Index: rebal.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/rebal.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** rebal.py 5 Sep 2002 16:16:43 -0000 1.1
--- rebal.py 12 Sep 2002 19:33:54 -0000 1.2
***************
*** 1,58 ****
! import os
! import sys
! import random
- '''
- dead = """
- Data/Ham/Set2/22467.txt
- Data/Ham/Set5/31389.txt
- Data/Ham/Set1/19642.txt
"""
! for f in dead.split():
! os.unlink(f)
! sys.exit(0)
! '''
NPERDIR = 4000
RESDIR = 'Data/Ham/reservoir'
! res = os.listdir(RESDIR)
! stuff = []
! for i in range(1, 6):
! dir = 'Data/Ham/Set%d' % i
! fs = os.listdir(dir)
! stuff.append((dir, fs))
! while stuff:
! dir, fs = stuff.pop()
! if len(fs) == NPERDIR:
! continue
! if len(fs) > NPERDIR:
! f = random.choice(fs)
! fs.remove(f)
! print "deleting", f, "from", dir
! os.unlink(dir + "/" + f)
! elif len(fs) < NPERDIR:
! print "need a new one for", dir
! f = random.choice(res)
! print "How about", f
! res.remove(f)
! fp = file(RESDIR + "/" + f, 'rb')
! guts = fp.read()
! fp.close()
! os.unlink(RESDIR + "/" + f)
! print guts
! ok = raw_input('good enough? ')
! if ok.startswith('y'):
! fp = file(dir + "/" + f, 'wb')
! fp.write(guts)
! fp.close()
! fs.append(f)
! stuff.append((dir, fs))
--- 1,166 ----
! #!/usr/bin/env python
"""
+ rebal.py - rebalance a ham or spam directory, moving files to or from
+ a reservoir directory as necessary.
! usage: rebal.py [ options ]
! options:
! -r res - specify an alternate reservoir [%(RESDIR)s]
! -s set - specify an alternate Set pfx [%(SETPFX)s]
! -n num - specify number of files per dir [%(NPERDIR)s]
! -v - tell user what's happening [%(VERBOSE)s]
! -q - be quiet about what's happening [not %(VERBOSE)s]
! -c - confirm file moves into Set directory [%(CONFIRM)s]
! -Q - be quiet and don't confirm moves
! The script will work with a variable number of Set directories, but they
! must already exist.
!
! Example:
!
! rebal.py -r reservoir -s Set -n 300
!
! This will move random files between the directory 'reservoir' and the
! various subdirectories prefixed with 'Set', making sure no more than 300
! files are left in the 'Set' directories when finished.
!
! Example:
!
! Suppose you want to shuffle your Set files around, winding up with 300 files
! in each one, you can execute:
!
! rebal.py -n 0
! rebal.py -n 300
!
! The first run will move all files from the various Data/Ham/Set directories
! to the Data/Ham/reservoir directory. The second run will randomly parcel
! out 300 files to each of the Data/Ham/Set directories.
! """
!
! import os
! import sys
! import random
! import glob
! import getopt
+ # defaults
NPERDIR = 4000
RESDIR = 'Data/Ham/reservoir'
! SETPFX = 'Data/Ham/Set'
! VERBOSE = True
! CONFIRM = True
! def usage():
! print >> sys.stderr, """\
! usage: rebal.py [ options ]
! options:
! -r res - specify an alternate reservoir [%(RESDIR)s]
! -s set - specify an alternate Set pfx [%(SETPFX)s]
! -n num - specify number of files per dir [%(NPERDIR)s]
! -v - tell user what's happening [%(VERBOSE)s]
! -q - be quiet about what's happening [not %(VERBOSE)s]
! -c - confirm file moves into Set directory [%(CONFIRM)s]
! -Q - be quiet and don't confirm moves
! """ % globals()
!
! def migrate(f, dir, verbose):
! """rename f into dir, making sure to avoid name clashes."""
! base = os.path.split(f)[-1]
! if os.path.exists(os.path.join(dir,base)):
! # this path can get slow if we have a lot of name collisions
! # but we should rarely encounter that case (so he says smugly)
! reslist = [int(n) for n in os.listdir(dir)]
! reslist.sort()
! out = os.path.join(dir, "%d"%(reslist[-1]+1))
! else:
! out = os.path.join(dir, base)
! if verbose:
! print "moving", f, "to", out
! os.rename(f, out)
!
! def main(args):
! nperdir = NPERDIR
! resdir = RESDIR
! setpfx = SETPFX
! verbose = VERBOSE
! confirm = CONFIRM
!
! try:
! opts, args = getopt.getopt(args, "r:s:n:vqcQh")
! except getopt.GetoptError:
! usage()
! return 1
! for opt, arg in opts:
! if opt == "-n":
! nperdir = int(arg)
! elif opt == "-r":
! resdir = arg
! elif opt == "-s":
! setpfx = arg
! elif opt == "-v":
! verbose = True
! elif opt == "-c":
! confirm = True
! elif opt == "-q":
! verbose = False
! elif opt == "-Q":
! verbose = confirm = False
! elif opt == "-h":
! usage()
! return 0
! res = os.listdir(resdir)
! dirs = glob.glob(setpfx+"*")
! if dirs == []:
! print >> sys.stderr, "no directories beginning with", setpfx, "exist."
! return 1
! stuff = []
! n = len(res)
! for dir in dirs:
! fs = os.listdir(dir)
! n += len(fs)
! stuff.append((dir, fs))
! if nperdir * len(dirs) > n:
! print >> sys.stderr, "not enough files to go around - use lower -n."
! return 1
! # if necessary, migrate random files to the reservoir
! for (dir, fs) in stuff:
! if nperdir >= len(fs):
! continue
!
! random.shuffle(fs)
! movethese = fs[nperdir:]
! del fs[nperdir:]
! for f in movethese:
! migrate(os.path.join(dir,f), resdir, verbose)
! res.extend(movethese)
!
! # randomize reservoir once so we can just bite chunks from the front
! random.shuffle(res)
!
! # grow Set* directories from the reservoir
! for (dir, fs) in stuff:
! if nperdir == len(fs):
! continue
!
! movethese = res[:nperdir-len(fs)]
! res = res[nperdir-len(fs):]
! for f in movethese:
! if confirm:
! print file(os.path.join(resdir,f)).read()
! ok = raw_input('good enough? ').lower()
! if not ok.startswith('y'):
! continue
! migrate(os.path.join(resdir,f), dir, verbose)
! fs.extend(movethese)
!
! return 0
!
! if __name__ == "__main__":
! sys.exit(main(sys.argv[1:]))
From montanaro@users.sourceforge.net Thu Sep 12 20:35:16 2002
From: montanaro@users.sourceforge.net (Skip Montanaro)
Date: Thu, 12 Sep 2002 12:35:16 -0700
Subject: [Spambayes-checkins] spambayes cmp.py,1.6,1.7 rates.py,1.2,1.3
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv1522
Modified Files:
cmp.py rates.py
Log Message:
add #! lines
Index: cmp.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/cmp.py,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** cmp.py 8 Sep 2002 18:38:59 -0000 1.6
--- cmp.py 12 Sep 2002 19:35:14 -0000 1.7
***************
*** 1,2 ****
--- 1,4 ----
+ #!/usr/bin/env python
+
"""
cmp.py sbase1 sbase2
Index: rates.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/rates.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** rates.py 7 Sep 2002 16:39:04 -0000 1.2
--- rates.py 12 Sep 2002 19:35:14 -0000 1.3
***************
*** 1,2 ****
--- 1,4 ----
+ #!/usr/bin/env python
+
"""
rates.py basename
From tim_one@users.sourceforge.net Fri Sep 13 00:59:08 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 12 Sep 2002 16:59:08 -0700
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.19,1.20
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv18626
Modified Files:
tokenizer.py
Log Message:
crack_urls(): Simpler tagging of embedded http etc thingies. Test
results show that the fine distinctions being drawn were a waste of code:
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.050 0.025 won -50.00%
0.000 0.000 tied
0.025 0.025 tied
0.000 0.000 tied
0.075 0.075 tied
0.025 0.025 tied
0.025 0.025 tied
0.000 0.000 tied
0.050 0.025 won -50.00%
0.000 0.000 tied
0.025 0.025 tied
0.000 0.000 tied
0.000 0.000 tied
0.050 0.050 tied
0.025 0.025 tied
0.000 0.000 tied
0.025 0.025 tied
0.050 0.025 won -50.00%
won 3 times
tied 17 times
lost 0 times
total unique fp went from 8 to 8 tied
false negative percentages
0.218 0.218 tied
0.364 0.364 tied
0.291 0.327 lost +12.37%
0.509 0.545 lost +7.07%
0.400 0.400 tied
0.218 0.218 tied
0.218 0.218 tied
0.582 0.545 won -6.36%
0.291 0.291 tied
0.255 0.255 tied
0.291 0.291 tied
0.582 0.582 tied
0.545 0.545 tied
0.255 0.255 tied
0.255 0.255 tied
0.400 0.400 tied
0.291 0.291 tied
0.218 0.218 tied
0.182 0.182 tied
0.145 0.182 lost +25.52%
won 1 times
tied 16 times
lost 3 times
total unique fn went from 86 to 87 lost +1.16%
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.19
retrieving revision 1.20
diff -C2 -d -r1.19 -r1.20
*** tokenizer.py 12 Sep 2002 04:19:38 -0000 1.19
--- tokenizer.py 12 Sep 2002 23:59:06 -0000 1.20
***************
*** 802,809 ****
while guts and guts[-1] in '.:?!/':
guts = guts[:-1]
! for i, piece in enumerate(guts.split('/')):
! prefix = "%s%s:" % (proto, i < 2 and str(i) or '>1')
for chunk in urlsep_re.split(piece):
! pushclue(prefix + chunk)
i = end
--- 802,808 ----
while guts and guts[-1] in '.:?!/':
guts = guts[:-1]
! for piece in guts.split('/'):
for chunk in urlsep_re.split(piece):
! pushclue("url:" + chunk)
i = end
From tim_one@users.sourceforge.net Fri Sep 13 01:14:21 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 12 Sep 2002 17:14:21 -0700
Subject: [Spambayes-checkins] spambayes Options.py,1.10,1.11
classifier.py,1.5,1.6
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv21941
Modified Files:
Options.py classifier.py
Log Message:
Added new options section [Classifier], allowing to change
HAMBIAS, SPAMBIAS, MIN_SPAMPROB, MAX_SPAMPROB, UNKNOWN_SPAMPROB
and MAX_DISCRIMINATORS. Play with them at your own risk .
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.10
retrieving revision 1.11
diff -C2 -d -r1.10 -r1.11
*** Options.py 12 Sep 2002 02:46:15 -0000 1.10
--- Options.py 13 Sep 2002 00:14:18 -0000 1.11
***************
*** 89,92 ****
--- 89,103 ----
save_trained_pickles: False
pickle_basename: class
+
+ [Classifier]
+ # Fiddling these can have extreme effects. See classifier.py for comments.
+ hambias: 2.0
+ spambias: 1.0
+
+ min_spamprob: 0.01
+ max_spamprob: 0.99
+ unknown_spamprob: 0.5
+
+ max_discriminators: 16
"""
***************
*** 115,118 ****
--- 126,136 ----
'show_charlimit': int_cracker,
},
+ 'Classifier': {'hambias': float_cracker,
+ 'spambias': float_cracker,
+ 'min_spamprob': float_cracker,
+ 'max_spamprob': float_cracker,
+ 'unknown_spamprob': float_cracker,
+ 'max_discriminators': int_cracker,
+ },
}
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** classifier.py 8 Sep 2002 03:17:31 -0000 1.5
--- classifier.py 13 Sep 2002 00:14:18 -0000 1.6
***************
*** 10,13 ****
--- 10,15 ----
from sets import Set
+ from Options import options
+
# The count of each word in ham is artificially boosted by a factor of
# HAMBIAS, and similarly for SPAMBIAS. Graham uses 2.0 and 1.0. Final
***************
*** 26,31 ****
# total unique false negatives goes up by a factor of 2.1 (337 -> 702)
! HAMBIAS = 2.0
! SPAMBIAS = 1.0
# "And then there is the question of what probability to assign to words
--- 28,33 ----
# total unique false negatives goes up by a factor of 2.1 (337 -> 702)
! HAMBIAS = options.hambias # 2.0
! SPAMBIAS = options.spambias # 1.0
# "And then there is the question of what probability to assign to words
***************
*** 35,40 ****
# of training data is good enough to justify probabilities of 0 or 1. It
# may justify probabilities outside this range, though.
! MIN_SPAMPROB = 0.01
! MAX_SPAMPROB = 0.99
# The spam probability assigned to words never seen before. Graham used
--- 37,42 ----
# of training data is good enough to justify probabilities of 0 or 1. It
# may justify probabilities outside this range, though.
! MIN_SPAMPROB = options.min_spamprob # 0.01
! MAX_SPAMPROB = options.max_spamprob # 0.99
# The spam probability assigned to words never seen before. Graham used
***************
*** 50,54 ****
# of kicking out a word with a prob in (0.2, 0.8), and that seems dubious
# on the face of it.
! UNKNOWN_SPAMPROB = 0.5
# "I only consider words that occur more than five times in total".
--- 52,56 ----
# of kicking out a word with a prob in (0.2, 0.8), and that seems dubious
# on the face of it.
! UNKNOWN_SPAMPROB = options.unknown_spamprob # 0.5
# "I only consider words that occur more than five times in total".
***************
*** 172,176 ****
# was a pure win, lowering the false negative rate consistently, and it even
# managed to tickle a couple rare false positives into "not spam" terrority.
! MAX_DISCRIMINATORS = 16
PICKLE_VERSION = 1
--- 174,178 ----
# was a pure win, lowering the false negative rate consistently, and it even
# managed to tickle a couple rare false positives into "not spam" terrority.
! MAX_DISCRIMINATORS = options.max_discriminators # 16
PICKLE_VERSION = 1
From tim_one@users.sourceforge.net Fri Sep 13 01:27:58 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 12 Sep 2002 17:27:58 -0700
Subject: [Spambayes-checkins] spambayes Options.py,1.11,1.12
timtest.py,1.22,1.23
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv25548
Modified Files:
Options.py timtest.py
Log Message:
Incompatible change: show_best_discriminators has changed from a bool
option to an int option, now giving the number of best discriminators
to show. Set to 0 if you don't want to see any.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.11
retrieving revision 1.12
diff -C2 -d -r1.11 -r1.12
*** Options.py 13 Sep 2002 00:14:18 -0000 1.11
--- Options.py 13 Sep 2002 00:27:55 -0000 1.12
***************
*** 73,77 ****
show_false_positives: True
show_false_negatives: False
! show_best_discriminators: True
# The maximum # of characters to display for a msg displayed due to the
--- 73,84 ----
show_false_positives: True
show_false_negatives: False
!
! # Near the end of Driver.test(), you can get a listing of the "best
! # discriminators" in the words from the training sets. These are the
! # words whose WordInfo.killcount values are highest, meaning they most
! # often were among the most extreme clues spamprob() found. The number
! # of best discriminators to show is given by show_best_discriminators;
! # set this <= 0 to suppress showing any of the best discriminators.
! show_best_discriminators: 30
# The maximum # of characters to display for a msg displayed due to the
***************
*** 121,125 ****
'show_false_negatives': boolean_cracker,
'show_histograms': boolean_cracker,
! 'show_best_discriminators': boolean_cracker,
'save_trained_pickles': boolean_cracker,
'pickle_basename': string_cracker,
--- 128,132 ----
'show_false_negatives': boolean_cracker,
'show_histograms': boolean_cracker,
! 'show_best_discriminators': int_cracker,
'save_trained_pickles': boolean_cracker,
'pickle_basename': string_cracker,
Index: timtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtest.py,v
retrieving revision 1.22
retrieving revision 1.23
diff -C2 -d -r1.22 -r1.23
*** timtest.py 12 Sep 2002 02:58:02 -0000 1.22
--- timtest.py 13 Sep 2002 00:27:55 -0000 1.23
***************
*** 238,245 ****
printmsg(e, prob, clues)
! if options.show_best_discriminators:
print
print " best discriminators:"
! stats = [(-1, None) for i in range(30)]
smallest_killcount = -1
for w, r in c.wordinfo.iteritems():
--- 238,245 ----
printmsg(e, prob, clues)
! if options.show_best_discriminators > 0:
print
print " best discriminators:"
! stats = [(-1, None)] * options.show_best_discriminators
smallest_killcount = -1
for w, r in c.wordinfo.iteritems():
From tim_one@users.sourceforge.net Fri Sep 13 03:40:52 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 12 Sep 2002 19:40:52 -0700
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.20,1.21
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv25025
Modified Files:
tokenizer.py
Log Message:
Added comment about Reply-To (can't tell whether it's worth tokenizing;
my error rates are too low now).
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.20
retrieving revision 1.21
diff -C2 -d -r1.20 -r1.21
*** tokenizer.py 12 Sep 2002 23:59:06 -0000 1.20
--- tokenizer.py 13 Sep 2002 02:40:50 -0000 1.21
***************
*** 867,873 ****
# becomes the most powerful indicator in the whole database.
#
! # From:
! # Reply-To:
! for field in ('from',):# 'reply-to',):
prefix = field + ':'
x = msg.get(field, 'none').lower()
--- 867,875 ----
# becomes the most powerful indicator in the whole database.
#
! # From: # this helps both rates
! # Reply-To: # my error rates are too low now to tell about this
! # # one (smalls wins & losses across runs, overall
! # # not significant), so leaving it out
! for field in ('from',):
prefix = field + ':'
x = msg.get(field, 'none').lower()
From tim_one@users.sourceforge.net Fri Sep 13 17:27:01 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Fri, 13 Sep 2002 09:27:01 -0700
Subject: [Spambayes-checkins] spambayes TestDriver.py,NONE,1.1
Options.py,1.12,1.13
README.txt,1.14,1.15 Tester.py,1.1,1.2 mboxtest.py,1.3,1.4
timtest.py,1.23,1.24
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv6489
Modified Files:
Options.py README.txt Tester.py mboxtest.py timtest.py
Added Files:
TestDriver.py
Log Message:
Moved most of the reusable stuff out of timtest.py into the new
TestDriver.py. Added new methods to various things for upcoming
support of efficient N-fold cross validation. timtest.py still works
exactly the way it did before, and I *hope* mboxtest.py does too
but I'm not set up to test that one.
--- NEW FILE: TestDriver.py ---
# Loop:
# # Set up a new base classifier for testing.
# train(ham, spam)
# # Run tests against (possibly variants of) this classifier.
# Loop:
# Optional:
# # Forget training for some subset of ham and spam. This
# # works against the base classifier trained at the start.
# forget(ham, spam)
# # Predict against other data.
# Loop:
# test(ham, spam)
# # Display stats against all runs on this classifier variant.
# finishtest()
# # Display stats against all runs.
# alldone()
from sets import Set
import cPickle as pickle
from heapq import heapreplace
from Options import options
import Tester
import classifier
class Hist:
"""Simple histograms of float values in [0.0, 1.0]."""
def __init__(self, nbuckets=20):
self.buckets = [0] * nbuckets
self.nbuckets = nbuckets
def add(self, x):
n = self.nbuckets
i = int(n * x)
if i >= n:
i = n-1
self.buckets[i] += 1
def __iadd__(self, other):
if self.nbuckets != other.nbuckets:
raise ValueError('bucket size mismatch')
for i in range(self.nbuckets):
self.buckets[i] += other.buckets[i]
return self
def display(self, WIDTH=60):
biggest = max(self.buckets)
hunit, r = divmod(biggest, WIDTH)
if r:
hunit += 1
print "* =", hunit, "items"
ndigits = len(str(biggest))
format = "%6.2f %" + str(ndigits) + "d"
for i, n in enumerate(self.buckets):
print format % (100.0 * i / self.nbuckets, n),
print '*' * ((n + hunit - 1) // hunit)
def printhist(tag, ham, spam):
print
print "Ham distribution for", tag
ham.display()
print
print "Spam distribution for", tag
spam.display()
def printmsg(msg, prob, clues):
print msg.tag
print "prob =", prob
for clue in clues:
print "prob(%r) = %g" % clue
print
guts = str(msg)
if options.show_charlimit > 0:
guts = guts[:options.show_charlimit]
print guts
class Driver:
def __init__(self):
self.falsepos = Set()
self.falseneg = Set()
self.global_ham_hist = Hist(options.nbuckets)
self.global_spam_hist = Hist(options.nbuckets)
self.ntimes_train_called = 0
def train(self, ham, spam):
self.classifier = classifier.GrahamBayes()
t = self.tester = Tester.Test(self.classifier)
print "Training on", ham, "&", spam, "...",
t.train(ham, spam)
print t.nham, "hams &", t.nspam, "spams"
self.orig_nham = t.nham
self.orig_nspam = t.nspam
self.trained_ham_hist = Hist(options.nbuckets)
self.trained_spam_hist = Hist(options.nbuckets)
self.ntimes_train_called += 1
if options.save_trained_pickles:
fname = "%s%d.pik" % (options.pickle_basename,
self.ntimes_train_called)
print " saving pickle to", fname
fp = file(fname, 'wb')
pickle.dump(self.classifier, fp, 1)
fp.close()
def forget(self, ham, spam):
c = self.classifier
t = self.tester
nham, nspam = self.orig_nham, self.orig_nspam
t.set_classifier(c.copy(), nham, nspam)
print "Forgetting", ham, "&", spam, "...",
t.untrain(ham, spam)
print nham - t.nham, "hams &", nspam - t.nspam, "spams"
self.trained_ham_hist = Hist(options.nbuckets)
self.trained_spam_hist = Hist(options.nbuckets)
def finishtest(self):
if options.show_histograms:
printhist("all in this training set:",
self.trained_ham_hist, self.trained_spam_hist)
self.global_ham_hist += self.trained_ham_hist
self.global_spam_hist += self.trained_spam_hist
def alldone(self):
if options.show_histograms:
printhist("all runs:", self.global_ham_hist, self.global_spam_hist)
def test(self, ham, spam):
c = self.classifier
t = self.tester
local_ham_hist = Hist(options.nbuckets)
local_spam_hist = Hist(options.nbuckets)
def new_ham(msg, prob, lo=options.show_ham_lo,
hi=options.show_ham_hi):
local_ham_hist.add(prob)
if lo <= prob <= hi:
print
print "Ham with prob =", prob
prob, clues = c.spamprob(msg, True)
printmsg(msg, prob, clues)
def new_spam(msg, prob, lo=options.show_spam_lo,
hi=options.show_spam_hi):
local_spam_hist.add(prob)
if lo <= prob <= hi:
print
print "Spam with prob =", prob
prob, clues = c.spamprob(msg, True)
printmsg(msg, prob, clues)
t.reset_test_results()
print " testing against", ham, "&", spam, "...",
t.predict(spam, True, new_spam)
t.predict(ham, False, new_ham)
print t.nham_tested, "hams &", t.nspam_tested, "spams"
print " false positive:", t.false_positive_rate()
print " false negative:", t.false_negative_rate()
newfpos = Set(t.false_positives()) - self.falsepos
self.falsepos |= newfpos
print " new false positives:", [e.tag for e in newfpos]
if not options.show_false_positives:
newfpos = ()
for e in newfpos:
print '*' * 78
prob, clues = c.spamprob(e, True)
printmsg(e, prob, clues)
newfneg = Set(t.false_negatives()) - self.falseneg
self.falseneg |= newfneg
print " new false negatives:", [e.tag for e in newfneg]
if not options.show_false_negatives:
newfneg = ()
for e in newfneg:
print '*' * 78
prob, clues = c.spamprob(e, True)
printmsg(e, prob, clues)
if options.show_best_discriminators > 0:
print
print " best discriminators:"
stats = [(-1, None)] * options.show_best_discriminators
smallest_killcount = -1
for w, r in c.wordinfo.iteritems():
if r.killcount > smallest_killcount:
heapreplace(stats, (r.killcount, w))
smallest_killcount = stats[0][0]
stats.sort()
for count, w in stats:
if count < 0:
continue
r = c.wordinfo[w]
print " %r %d %g" % (w, r.killcount, r.spamprob)
if options.show_histograms:
printhist("this pair:", local_ham_hist, local_spam_hist)
self.trained_ham_hist += local_ham_hist
self.trained_spam_hist += local_spam_hist
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.12
retrieving revision 1.13
diff -C2 -d -r1.12 -r1.13
*** Options.py 13 Sep 2002 00:27:55 -0000 1.12
--- Options.py 13 Sep 2002 16:26:58 -0000 1.13
***************
*** 57,61 ****
[TestDriver]
! # These control various displays in class Driver (timtest.py).
# Number of buckets in histograms.
--- 57,61 ----
[TestDriver]
! # These control various displays in class TestDriver.Driver.
# Number of buckets in histograms.
Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.14
retrieving revision 1.15
diff -C2 -d -r1.14 -r1.15
*** README.txt 9 Sep 2002 19:24:52 -0000 1.14
--- README.txt 13 Sep 2002 16:26:58 -0000 1.15
***************
*** 20,35 ****
! Primary Files
! =============
Options.py
! A start at a flexible way to control what the tokenizer and
! classifier do. Different people are finding different ways in
! which their test data is biased, and so fiddle the code to
! worm around that. It's become almost impossible to know
! exactly what someone did when they report results.
classifier.py
An implementation of a Graham-like classifier.
Tester.py
A test-driver class that feeds streams of msgs to a classifier
--- 20,44 ----
! Primary Core Files
! ==================
Options.py
! Uses ConfigParser to allow fiddling various aspects of the classifier,
! tokenizer, and test drivers. Create a file named bayescustomize.ini to
! alter the defaults; all options and their default values can be found
! in the string "defaults" near the top of Options.py, which is really
! an .ini file embedded in the module. Modules wishing to control
! aspects of their operation merely do
!
! from Options import options
!
! near the start, and consult attributes of options.
classifier.py
An implementation of a Graham-like classifier.
+ tokenizer.py
+ An implementation of tokenize() that Tim can't seem to help but keep
+ working on .
+
Tester.py
A test-driver class that feeds streams of msgs to a classifier
***************
*** 37,58 ****
of false positives and false negatives.
hammie.py
! A spamassassin-like filter which uses tokenizer (below) and
! classifier (above). Needs to be made faster, especially for writes.
- mboxtest.py
- A concrete test driver like timtest.py (see below), but working
- with a pair of mailbox files rather than the specialized timtest
- setup.
! tokenizer.py
! An implementation of tokenize() that Tim can't seem to help but keep
! working on .
timtest.py
! A concrete test driver that uses Tester and classifier (above). This
! assumes "a standard" test data setup (see below). Could stand massive
! refactoring. You need to fiddle a line near the top to import a
! tokenize() function of your choosing.
--- 46,75 ----
of false positives and false negatives.
+ TestDriver.py
+ A higher layer of test helpers, building on Tester above. It's
+ quite usable as-is for building simple test drivers, and more
+ complicated ones up to NxN test grids. It's in the process of being
+ extended to allow easy building of N-way cross validation drivers
+ (the trick to that is doing so efficiently). See also rates.py
+ and cmp.py below.
+
+
+ Apps
+ ====
hammie.py
! A spamassassin-like filter which uses tokenizer and classifier (above).
! Needs to be made faster, especially for writes.
! Concrete Test Drivers
! =====================
! mboxtest.py
! A concrete test driver like timtest.py, but working with a pair of
! mailbox files rather than the specialized timtest setup.
timtest.py
! A concrete test driver like mboxtest.py, but working with "a
! standard" test data setup (see below) rather than the specialized
! mboxtest setup.
***************
*** 105,108 ****
--- 122,131 ----
Standard Test Data Setup
========================
+ [Caution: I'm going to switch this to support N-way cross validation,
+ instead of an NxN test grid. The only effect on the directory structure
+ here is that you'll want more directories with fewer msgs in each
+ (splitting the data at random into 10 pairs should work very well).
+ ]
+
Barry gave me mboxes, but the spam corpus I got off the web had one spam
per file, and it only took two days of extreme pain to realize that one msg
Index: Tester.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Tester.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** Tester.py 5 Sep 2002 16:16:43 -0000 1.1
--- Tester.py 13 Sep 2002 16:26:58 -0000 1.2
***************
*** 2,10 ****
# Pass a classifier instance (an instance of GrahamBayes).
# Loop:
! # Optional:
! # Train it, via train().
! # reset_test_results()
# Loop:
! # invoke predict() with (probably new) examples
# Optional:
# suck out the results, via instance vrbls and
--- 2,17 ----
# Pass a classifier instance (an instance of GrahamBayes).
# Loop:
! # # Train the classifer with new ham and spam.
! # train(ham, spam) # this implies reset_test_results
# Loop:
! # Optional:
! # # Possibly fiddle the classifier.
! # set_classifier()
! # # Forget smessages the classifier was trained on.
! # untrain(ham, spam) # this implies reset_test_results
! # Optional:
! # reset_test_results()
! # # Predict against (presumably new) examples.
! # predict(ham, spam)
# Optional:
# suck out the results, via instance vrbls and
***************
*** 13,20 ****
def __init__(self, classifier):
self.classifier = classifier
# The number of ham and spam instances in the training data.
! self.nham = self.nspam = 0
! self.reset_test_results()
def reset_test_results(self):
--- 20,33 ----
def __init__(self, classifier):
+ self.set_classifier(classifier, 0, 0)
+ self.reset_test_results()
+
+ # Tell the tester which classifier to use, and how many ham and spam it's
+ # been trained on.
+ def set_classifier(self, classifier, nham, nspam):
self.classifier = classifier
# The number of ham and spam instances in the training data.
! self.nham = nham
! self.nspam = nspam
def reset_test_results(self):
***************
*** 33,38 ****
# Train the classifier on streams of ham and spam. Updates probabilities
! # before returning.
def train(self, hamstream=None, spamstream=None):
learn = self.classifier.learn
if hamstream is not None:
--- 46,52 ----
# Train the classifier on streams of ham and spam. Updates probabilities
! # before returning, and resets test results.
def train(self, hamstream=None, spamstream=None):
+ self.reset_test_results()
learn = self.classifier.learn
if hamstream is not None:
***************
*** 46,49 ****
--- 60,78 ----
self.classifier.update_probabilities()
+ # Untrain the classifier on streams of ham and spam. Updates
+ # probabilities before returning, and resets test results.
+ def untrain(self, hamstream=None, spamstream=None):
+ self.reset_test_results()
+ unlearn = self.classifier.unlearn
+ if hamstream is not None:
+ for example in hamstream:
+ unlearn(example, False, False)
+ self.nham -= 1
+ if spamstream is not None:
+ for example in spamstream:
+ unlearn(example, True, False)
+ self.nspam -= 1
+ self.classifier.update_probabilities()
+
# Run prediction on each sample in stream. You're swearing that stream
# is entirely composed of spam (is_spam True), or of ham (is_spam False).
***************
*** 113,117 ****
>>> t = Test(GrahamBayes())
>>> t.train([good1, good2], [bad1])
- >>> t.reset_test_results()
>>> t.predict([_Example('goodham', ['a', 'b']),
... _Example('badham', ['d'])
--- 142,145 ----
Index: mboxtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/mboxtest.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** mboxtest.py 12 Sep 2002 02:46:15 -0000 1.3
--- mboxtest.py 13 Sep 2002 16:26:58 -0000 1.4
***************
*** 8,12 ****
One of unix, mmdf, mh, or qmail. Specifies mailbox format for
ham and spam files. Default is unix.
!
-n NSETS
Number of test sets to create for a single mailbox. Default is 5.
--- 8,12 ----
One of unix, mmdf, mh, or qmail. Specifies mailbox format for
ham and spam files. Default is unix.
!
-n NSETS
Number of test sets to create for a single mailbox. Default is 5.
***************
*** 19,27 ****
"""
- from tokenizer import Tokenizer, subject_word_re, tokenize_word, tokenize
- from classifier import GrahamBayes
- from Tester import Test
- from timtest import Driver, Msg
-
import getopt
import mailbox
--- 19,22 ----
***************
*** 30,33 ****
--- 25,32 ----
import sys
+ from tokenizer import Tokenizer, subject_word_re, tokenize_word, tokenize
+ from TestDriver import Driver
+ from timtest import Msg
+
mbox_fmts = {"unix": mailbox.PortableUnixMailbox,
"mmdf": mailbox.MmdfMailbox,
***************
*** 129,133 ****
def main(args):
global FMT
!
FMT = "unix"
NSETS = 5
--- 128,132 ----
def main(args):
global FMT
!
FMT = "unix"
NSETS = 5
***************
*** 163,167 ****
for ispam in randindices(nspam, NSETS):
testsets.append((sort(iham), sort(ispam)))
!
driver = Driver()
--- 162,166 ----
for ispam in randindices(nspam, NSETS):
testsets.append((sort(iham), sort(ispam)))
!
driver = Driver()
***************
*** 177,179 ****
if __name__ == "__main__":
sys.exit(main(sys.argv[1:]))
-
--- 176,177 ----
Index: timtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtest.py,v
retrieving revision 1.23
retrieving revision 1.24
diff -C2 -d -r1.23 -r1.24
*** timtest.py 13 Sep 2002 00:27:55 -0000 1.23
--- timtest.py 13 Sep 2002 16:26:58 -0000 1.24
***************
*** 20,31 ****
import os
import sys
- from sets import Set
- import cPickle as pickle
- from heapq import heapreplace
- import Tester
- import classifier
- from tokenizer import tokenize
from Options import options
program = sys.argv[0]
--- 20,27 ----
import os
import sys
from Options import options
+ from tokenizer import tokenize
+ from TestDriver import Driver
program = sys.argv[0]
***************
*** 39,95 ****
sys.exit(code)
- class Hist:
- def __init__(self, nbuckets=20):
- self.buckets = [0] * nbuckets
- self.nbuckets = nbuckets
-
- def add(self, x):
- n = self.nbuckets
- i = int(n * x)
- if i >= n:
- i = n-1
- self.buckets[i] += 1
-
- def __iadd__(self, other):
- if self.nbuckets != other.nbuckets:
- raise ValueError('bucket size mismatch')
- for i in range(self.nbuckets):
- self.buckets[i] += other.buckets[i]
- return self
-
- def display(self, WIDTH=60):
- biggest = max(self.buckets)
- hunit, r = divmod(biggest, WIDTH)
- if r:
- hunit += 1
- print "* =", hunit, "items"
-
- ndigits = len(str(biggest))
- format = "%6.2f %" + str(ndigits) + "d"
-
- for i, n in enumerate(self.buckets):
- print format % (100.0 * i / self.nbuckets, n),
- print '*' * ((n + hunit - 1) // hunit)
-
- def printhist(tag, ham, spam):
- print
- print "Ham distribution for", tag
- ham.display()
-
- print
- print "Spam distribution for", tag
- spam.display()
-
- def printmsg(msg, prob, clues):
- print msg.tag
- print "prob =", prob
- for clue in clues:
- print "prob(%r) = %g" % clue
- print
- guts = str(msg)
- if options.show_charlimit > 0:
- guts = guts[:options.show_charlimit]
- print guts
-
class Msg(object):
def __init__(self, dir, name):
--- 35,38 ----
***************
*** 125,129 ****
yield Msg(directory, fname)
! def xproduce(self):
import random
directory = self.directory
--- 68,72 ----
yield Msg(directory, fname)
! def produce(self):
import random
directory = self.directory
***************
*** 136,261 ****
def __iter__(self):
return self.produce()
-
-
- # Loop:
- # train() # on ham and spam
- # Loop:
- # test() # on presumably new ham and spam
- # finishtest() # display stats against all runs on training set
- # alldone() # display stats against all runs
-
- class Driver:
-
- def __init__(self):
- self.falsepos = Set()
- self.falseneg = Set()
- self.global_ham_hist = Hist(options.nbuckets)
- self.global_spam_hist = Hist(options.nbuckets)
- self.ntimes_train_called = 0
-
- def train(self, ham, spam):
- self.classifier = classifier.GrahamBayes()
- t = self.tester = Tester.Test(self.classifier)
-
- print "Training on", ham, "&", spam, "...",
- t.train(ham, spam)
- print t.nham, "hams &", t.nspam, "spams"
-
- self.trained_ham_hist = Hist(options.nbuckets)
- self.trained_spam_hist = Hist(options.nbuckets)
-
- self.ntimes_train_called += 1
- if options.save_trained_pickles:
- fname = "%s%d.pik" % (options.pickle_basename,
- self.ntimes_train_called)
- print " saving pickle to", fname
- fp = file(fname, 'wb')
- pickle.dump(self.classifier, fp, 1)
- fp.close()
-
- def finishtest(self):
- if options.show_histograms:
- printhist("all in this training set:",
- self.trained_ham_hist, self.trained_spam_hist)
- self.global_ham_hist += self.trained_ham_hist
- self.global_spam_hist += self.trained_spam_hist
-
- def alldone(self):
- if options.show_histograms:
- printhist("all runs:", self.global_ham_hist, self.global_spam_hist)
-
- def test(self, ham, spam):
- c = self.classifier
- t = self.tester
- local_ham_hist = Hist(options.nbuckets)
- local_spam_hist = Hist(options.nbuckets)
-
- def new_ham(msg, prob, lo=options.show_ham_lo,
- hi=options.show_ham_hi):
- local_ham_hist.add(prob)
- if lo <= prob <= hi:
- print
- print "Ham with prob =", prob
- prob, clues = c.spamprob(msg, True)
- printmsg(msg, prob, clues)
-
- def new_spam(msg, prob, lo=options.show_spam_lo,
- hi=options.show_spam_hi):
- local_spam_hist.add(prob)
- if lo <= prob <= hi:
- print
- print "Spam with prob =", prob
- prob, clues = c.spamprob(msg, True)
- printmsg(msg, prob, clues)
-
- t.reset_test_results()
- print " testing against", ham, "&", spam, "...",
- t.predict(spam, True, new_spam)
- t.predict(ham, False, new_ham)
- print t.nham_tested, "hams &", t.nspam_tested, "spams"
-
- print " false positive:", t.false_positive_rate()
- print " false negative:", t.false_negative_rate()
-
- newfpos = Set(t.false_positives()) - self.falsepos
- self.falsepos |= newfpos
- print " new false positives:", [e.tag for e in newfpos]
- if not options.show_false_positives:
- newfpos = ()
- for e in newfpos:
- print '*' * 78
- prob, clues = c.spamprob(e, True)
- printmsg(e, prob, clues)
-
- newfneg = Set(t.false_negatives()) - self.falseneg
- self.falseneg |= newfneg
- print " new false negatives:", [e.tag for e in newfneg]
- if not options.show_false_negatives:
- newfneg = ()
- for e in newfneg:
- print '*' * 78
- prob, clues = c.spamprob(e, True)
- printmsg(e, prob, clues)
-
- if options.show_best_discriminators > 0:
- print
- print " best discriminators:"
- stats = [(-1, None)] * options.show_best_discriminators
- smallest_killcount = -1
- for w, r in c.wordinfo.iteritems():
- if r.killcount > smallest_killcount:
- heapreplace(stats, (r.killcount, w))
- smallest_killcount = stats[0][0]
- stats.sort()
- for count, w in stats:
- if count < 0:
- continue
- r = c.wordinfo[w]
- print " %r %d %g" % (w, r.killcount, r.spamprob)
-
- if options.show_histograms:
- printhist("this pair:", local_ham_hist, local_spam_hist)
- self.trained_ham_hist += local_ham_hist
- self.trained_spam_hist += local_spam_hist
def drive(nsets):
--- 79,82 ----
From tim_one@users.sourceforge.net Fri Sep 13 17:55:20 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Fri, 13 Sep 2002 09:55:20 -0700
Subject: [Spambayes-checkins] spambayes classifier.py,1.6,1.7
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv16545
Modified Files:
classifier.py
Log Message:
Removed GrahamBayes.DEBUG. It slows things down and I've never had a
use for it (the options support printing lots of stuff from the
test drivers, and that's always been plenty to resolve suspected bugs).
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** classifier.py 13 Sep 2002 00:14:18 -0000 1.6
--- classifier.py 13 Sep 2002 16:55:17 -0000 1.7
***************
*** 219,224 ****
)
- DEBUG = False
-
def __init__(self):
self.wordinfo = {}
--- 219,222 ----
***************
*** 436,442 ****
"""
- if self.DEBUG:
- print "spamprob(%r)" % wordstream
-
# A priority queue to remember the MAX_DISCRIMINATORS best
# probabilities, where "best" means largest distance from 0.5.
--- 434,437 ----
***************
*** 495,500 ****
if evidence:
clues.append((word, prob))
- if self.DEBUG:
- print 'nbest P(%r) = %g' % (word, prob)
prob_product *= prob / sp
inverse_prob_product *= (1.0 - prob) / hp
--- 490,493 ----
***************
*** 577,585 ****
self.wordinfo[word] = record
- if self.DEBUG:
- print 'New probabilities:'
- for w, r in self.wordinfo.iteritems():
- print "P(%r) = %g" % (w, r.spamprob)
-
def clearjunk(self, oldesttime):
"""Forget useless wordinfo records. This can shrink the database size.
--- 570,573 ----
***************
*** 593,604 ****
tonuke = [w for w, r in wordinfo.iteritems() if r.atime < oldesttime]
for w in tonuke:
- if self.DEBUG:
- print "clearjunk removing word %r: %r" % (w, r)
del wordinfo[w]
def _add_msg(self, wordstream, is_spam):
- if self.DEBUG:
- print "_add_msg(%r, %r)" % (wordstream, is_spam)
-
if is_spam:
self.nspam += 1
--- 581,587 ----
***************
*** 620,631 ****
wordinfo[word] = record
- if self.DEBUG:
- print "new count for %r = %d" % (word,
- is_spam and record.spamcount or record.hamcount)
-
def _remove_msg(self, wordstream, is_spam):
- if self.DEBUG:
- print "_remove_msg(%r, %r)" % (wordstream, is_spam)
-
if is_spam:
if self.nspam <= 0:
--- 603,607 ----
From tim_one@users.sourceforge.net Fri Sep 13 18:49:06 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Fri, 13 Sep 2002 10:49:06 -0700
Subject: [Spambayes-checkins] spambayes TestDriver.py,1.1,1.2
Tester.py,1.2,1.3
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv30444
Modified Files:
TestDriver.py Tester.py
Log Message:
A little closer to N-fold cross validation.
Removed the Tester nham and nspam attributes. If used properly, they
should have exactly the same values as the classifier's attributes of
the same names. Duplicating the info just created more chances to
screw up.
Changed when classifier pickles are saved, from immediately after
training to Driver.finishtest(). This way meaningful killcounts
are pickled. Since WordInfo.spamprob is almost never 0.5 anymore,
it would be nice to have another gimmick for pruning junk from the
database that doesn't rely on months going by to see which records
remain unused. It *may* work well to prune away WordInfo records
that never survived into spamprob()'s nbest list during testing. That's
speculation and needs to be verified via testing; I don't expect to
get to that in the near future, though; note that testing this would
require splitting the data in a different way, since, by construction,
a word with killcount=0 had no effect whatsoever on any outcome during
predictions.
A very quick check suggested that about half the words in a database
do have killcount 0; I'm surprised it's not a lot more than that, so
maybe I did something wrong; or maybe that's really how things are.
Index: TestDriver.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** TestDriver.py 13 Sep 2002 16:26:58 -0000 1.1
--- TestDriver.py 13 Sep 2002 17:49:02 -0000 1.2
***************
*** 12,15 ****
--- 12,17 ----
# test(ham, spam)
# # Display stats against all runs on this classifier variant.
+ # # This also saves the trained classifer, if desired (option
+ # # save_trained_pickles).
# finishtest()
# # Display stats against all runs.
***************
*** 86,123 ****
self.global_ham_hist = Hist(options.nbuckets)
self.global_spam_hist = Hist(options.nbuckets)
! self.ntimes_train_called = 0
def train(self, ham, spam):
! self.classifier = classifier.GrahamBayes()
! t = self.tester = Tester.Test(self.classifier)
print "Training on", ham, "&", spam, "...",
t.train(ham, spam)
! print t.nham, "hams &", t.nspam, "spams"
! self.orig_nham = t.nham
! self.orig_nspam = t.nspam
self.trained_ham_hist = Hist(options.nbuckets)
self.trained_spam_hist = Hist(options.nbuckets)
- self.ntimes_train_called += 1
- if options.save_trained_pickles:
- fname = "%s%d.pik" % (options.pickle_basename,
- self.ntimes_train_called)
- print " saving pickle to", fname
- fp = file(fname, 'wb')
- pickle.dump(self.classifier, fp, 1)
- fp.close()
-
def forget(self, ham, spam):
! c = self.classifier
! t = self.tester
! nham, nspam = self.orig_nham, self.orig_nspam
! t.set_classifier(c.copy(), nham, nspam)
print "Forgetting", ham, "&", spam, "...",
! t.untrain(ham, spam)
! print nham - t.nham, "hams &", nspam - t.nspam, "spams"
self.trained_ham_hist = Hist(options.nbuckets)
self.trained_spam_hist = Hist(options.nbuckets)
--- 88,118 ----
self.global_ham_hist = Hist(options.nbuckets)
self.global_spam_hist = Hist(options.nbuckets)
! self.ntimes_finishtest_called = 0
def train(self, ham, spam):
! c = self.classifier = classifier.GrahamBayes()
! t = self.tester = Tester.Test(c)
print "Training on", ham, "&", spam, "...",
t.train(ham, spam)
! print c.nham, "hams &", c.nspam, "spams"
self.trained_ham_hist = Hist(options.nbuckets)
self.trained_spam_hist = Hist(options.nbuckets)
def forget(self, ham, spam):
! import copy
print "Forgetting", ham, "&", spam, "...",
! c = self.classifier
! nham, nspam = c.nham, c.nspam
! c = copy.deepcopy(c)
! t.set_classifier(c)
!
! self.tester.untrain(ham, spam)
! print nham - c.nham, "hams &", nspam - c.nspam, "spams"
+ self.global_ham_hist += self.trained_ham_hist
+ self.global_spam_hist += self.trained_spam_hist
self.trained_ham_hist = Hist(options.nbuckets)
self.trained_spam_hist = Hist(options.nbuckets)
***************
*** 129,132 ****
--- 124,136 ----
self.global_ham_hist += self.trained_ham_hist
self.global_spam_hist += self.trained_spam_hist
+
+ self.ntimes_finishtest_called += 1
+ if options.save_trained_pickles:
+ fname = "%s%d.pik" % (options.pickle_basename,
+ self.ntimes_finishtest_called)
+ print " saving pickle to", fname
+ fp = file(fname, 'wb')
+ pickle.dump(self.classifier, fp, 1)
+ fp.close()
def alldone(self):
Index: Tester.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Tester.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** Tester.py 13 Sep 2002 16:26:58 -0000 1.2
--- Tester.py 13 Sep 2002 17:49:02 -0000 1.3
***************
*** 20,33 ****
def __init__(self, classifier):
! self.set_classifier(classifier, 0, 0)
self.reset_test_results()
! # Tell the tester which classifier to use, and how many ham and spam it's
! # been trained on.
! def set_classifier(self, classifier, nham, nspam):
self.classifier = classifier
- # The number of ham and spam instances in the training data.
- self.nham = nham
- self.nspam = nspam
def reset_test_results(self):
--- 20,29 ----
def __init__(self, classifier):
! self.set_classifier(classifier)
self.reset_test_results()
! # Tell the tester which classifier to use.
! def set_classifier(self, classifier):
self.classifier = classifier
def reset_test_results(self):
***************
*** 53,61 ****
for example in hamstream:
learn(example, False, False)
- self.nham += 1
if spamstream is not None:
for example in spamstream:
learn(example, True, False)
- self.nspam += 1
self.classifier.update_probabilities()
--- 49,55 ----
***************
*** 68,76 ****
for example in hamstream:
unlearn(example, False, False)
- self.nham -= 1
if spamstream is not None:
for example in spamstream:
unlearn(example, True, False)
- self.nspam -= 1
self.classifier.update_probabilities()
--- 62,68 ----
From tim_one@users.sourceforge.net Fri Sep 13 19:48:44 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Fri, 13 Sep 2002 11:48:44 -0700
Subject: [Spambayes-checkins] spambayes timtest.py,1.24,1.25
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv27680
Modified Files:
timtest.py
Log Message:
Checked in a temp change by mistake.
Index: timtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtest.py,v
retrieving revision 1.24
retrieving revision 1.25
diff -C2 -d -r1.24 -r1.25
*** timtest.py 13 Sep 2002 16:26:58 -0000 1.24
--- timtest.py 13 Sep 2002 18:48:42 -0000 1.25
***************
*** 68,72 ****
yield Msg(directory, fname)
! def produce(self):
import random
directory = self.directory
--- 68,72 ----
yield Msg(directory, fname)
! def xproduce(self):
import random
directory = self.directory
From tim_one@users.sourceforge.net Fri Sep 13 20:33:06 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Fri, 13 Sep 2002 12:33:06 -0700
Subject: [Spambayes-checkins] spambayes timcv.py,NONE,1.1 README.txt,1.15,1.16
TestDriver.py,1.2,1.3 classifier.py,1.7,1.8
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv8875
Modified Files:
README.txt TestDriver.py classifier.py
Added Files:
timcv.py
Log Message:
timcv may or may not be a working N-fold cross validating test driver.
If it's not, it's getting close . This turned up a few bugs in
other places, primarily that GrahamBayes._remove_msg() didn't delete
a word record if the spam and ham counts both fell to 0. It's a
subtle invariant of the scheme that at least one of those counts is
non-zero.
--- NEW FILE: timcv.py ---
#! /usr/bin/env python
# At the moment, this requires Python 2.3 from CVS (heapq, Set, enumerate).
# A driver for N-fold cross validation.
"""Usage: %(program)s [-h] -n nsets
Where:
-h
Show usage and exit.
-n int
Number of Set directories (Data/Spam/Set1, ... and Data/Ham/Set1, ...).
This is required.
In addition, an attempt is made to merge bayescustomize.ini into the options.
If that exists, it can be used to change the settings in Options.options.
"""
import os
import sys
from Options import options
from tokenizer import tokenize
from TestDriver import Driver
program = sys.argv[0]
def usage(code, msg=''):
"""Print usage message and sys.exit(code)."""
if msg:
print >> sys.stderr, msg
print >> sys.stderr
print >> sys.stderr, __doc__ % globals()
sys.exit(code)
class Msg(object):
def __init__(self, dir, name):
path = dir + "/" + name
self.tag = path
f = open(path, 'rb')
guts = f.read()
f.close()
self.guts = guts
def __iter__(self):
return tokenize(self.guts)
def __hash__(self):
return hash(self.tag)
def __eq__(self, other):
return self.tag == other.tag
def __str__(self):
return self.guts
class MsgStream(object):
def __init__(self, tag, directories):
self.tag = tag
self.directories = directories
def __str__(self):
return self.tag
def produce(self):
for directory in self.directories:
for fname in os.listdir(directory):
yield Msg(directory, fname)
def __iter__(self):
return self.produce()
def drive(nsets):
print options.display()
hamdirs = ["Data/Ham/Set%d" % i for i in range(1, nsets+1)]
spamdirs = ["Data/Spam/Set%d" % i for i in range(1, nsets+1)]
d = Driver()
# Train it on all the data.
d.train(MsgStream("%s-%d" % (hamdirs[0], nsets), hamdirs),
MsgStream("%s-%d" % (spamdirs[0], nsets), spamdirs))
# Now run nsets times, removing one pair per run.
for i in range(nsets):
h = hamdirs[:]
s = spamdirs[:]
hexclude = h.pop(i)
sexclude = s.pop(i)
d.forget(MsgStream(hexclude, [hexclude]),
MsgStream(sexclude, [sexclude]))
d.test(MsgStream("Data/Ham/*-Set%d" % (i+1), h),
MsgStream("Data/Spam/*-Set%d" % (i+1), s))
d.finishtest()
d.alldone()
if __name__ == "__main__":
import getopt
try:
opts, args = getopt.getopt(sys.argv[1:], 'hn:')
except getopt.error, msg:
usage(1, msg)
nsets = None
for opt, arg in opts:
if opt == '-h':
usage(0)
elif opt == '-n':
nsets = int(arg)
if args:
usage(1, "Positional arguments not supported")
if nsets is None:
usage(1, "-n is required")
drive(nsets)
Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.15
retrieving revision 1.16
diff -C2 -d -r1.15 -r1.16
*** README.txt 13 Sep 2002 16:26:58 -0000 1.15
--- README.txt 13 Sep 2002 19:33:04 -0000 1.16
***************
*** 73,76 ****
--- 73,81 ----
mboxtest setup.
+ timcv.py
+ A first stab at an N-fold cross-validating test driver. Assumes
+ "a standard" data directory setup (see below).
+ Subject to arbitrary change.
+
Test Utilities
Index: TestDriver.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** TestDriver.py 13 Sep 2002 17:49:02 -0000 1.2
--- TestDriver.py 13 Sep 2002 19:33:04 -0000 1.3
***************
*** 104,112 ****
import copy
! print "Forgetting", ham, "&", spam, "...",
c = self.classifier
nham, nspam = c.nham, c.nspam
c = copy.deepcopy(c)
! t.set_classifier(c)
self.tester.untrain(ham, spam)
--- 104,112 ----
import copy
! print " forgetting", ham, "&", spam, "...",
c = self.classifier
nham, nspam = c.nham, c.nspam
c = copy.deepcopy(c)
! self.tester.set_classifier(c)
self.tester.untrain(ham, spam)
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.7
retrieving revision 1.8
diff -C2 -d -r1.7 -r1.8
*** classifier.py 13 Sep 2002 16:55:17 -0000 1.7
--- classifier.py 13 Sep 2002 19:33:04 -0000 1.8
***************
*** 623,624 ****
--- 623,626 ----
if record.hamcount > 0:
record.hamcount -= 1
+ if record.hamcount == 0 == record.spamcount:
+ del self.wordinfo[word]
From tim_one@users.sourceforge.net Fri Sep 13 20:46:43 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Fri, 13 Sep 2002 12:46:43 -0700
Subject: [Spambayes-checkins] spambayes classifier.py,1.8,1.9
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv13430
Modified Files:
classifier.py
Log Message:
Class WordInfo: Noted a subtle invariant in a comment.
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.8
retrieving revision 1.9
diff -C2 -d -r1.8 -r1.9
*** classifier.py 13 Sep 2002 19:33:04 -0000 1.8
--- classifier.py 13 Sep 2002 19:46:41 -0000 1.9
***************
*** 185,188 ****
--- 185,192 ----
'spamprob', # prob(spam | msg contains this word)
)
+
+ # Invariant: For use in a classifier database, at least one of
+ # spamcount and hamcount must be non-zero.
+ #
# (*)atime is the last access time, a UTC time.time() value. It's the
# most recent time this word was used by scoring (i.e., by spamprob(),
From tim_one@users.sourceforge.net Fri Sep 13 20:59:37 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Fri, 13 Sep 2002 12:59:37 -0700
Subject: [Spambayes-checkins] spambayes timcv.py,1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv17794
Modified Files:
timcv.py
Log Message:
Msg.__init__: tiny simplification.
Index: timcv.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timcv.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** timcv.py 13 Sep 2002 19:33:04 -0000 1.1
--- timcv.py 13 Sep 2002 19:59:35 -0000 1.2
***************
*** 39,45 ****
self.tag = path
f = open(path, 'rb')
! guts = f.read()
f.close()
- self.guts = guts
def __iter__(self):
--- 39,44 ----
self.tag = path
f = open(path, 'rb')
! self.guts = f.read()
f.close()
def __iter__(self):
From tim_one@users.sourceforge.net Fri Sep 13 21:35:40 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Fri, 13 Sep 2002 13:35:40 -0700
Subject: [Spambayes-checkins] spambayes timcv.py,1.2,1.3
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv29041
Modified Files:
timcv.py
Log Message:
Fixed some major brainos, but this is still hosed. Worse, thanks at
least to giant pickle memos and giant deepcopy memos, running just a
3-fold c-v on 3 of my test directory pairs takes more than 3X the memory
of running the 5x5 test grid over all 5 directory pairs. So this isn't
at all usable yet. Luckily, it's not working right anyway .
Index: timcv.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timcv.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** timcv.py 13 Sep 2002 19:59:35 -0000 1.2
--- timcv.py 13 Sep 2002 20:35:37 -0000 1.3
***************
*** 83,94 ****
# Now run nsets times, removing one pair per run.
for i in range(nsets):
! h = hamdirs[:]
! s = spamdirs[:]
! hexclude = h.pop(i)
! sexclude = s.pop(i)
! d.forget(MsgStream(hexclude, [hexclude]),
! MsgStream(sexclude, [sexclude]))
! d.test(MsgStream("Data/Ham/*-Set%d" % (i+1), h),
! MsgStream("Data/Spam/*-Set%d" % (i+1), s))
d.finishtest()
d.alldone()
--- 83,92 ----
# Now run nsets times, removing one pair per run.
for i in range(nsets):
! h = hamdirs[i]
! s = spamdirs[i]
! hamstream = MsgStream(h, [h])
! spamstream = MsgStream(s, [s])
! d.forget(hamstream, spamstream)
! d.test(hamstream, spamstream)
d.finishtest()
d.alldone()
From rubiconx@users.sourceforge.net Fri Sep 13 22:27:28 2002
From: rubiconx@users.sourceforge.net (Neale Pickett)
Date: Fri, 13 Sep 2002 14:27:28 -0700
Subject: [Spambayes-checkins] spambayes cdbhammie.py,1.2,NONE
cdbwrap.py,1.1,NONE
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv15963
Removed Files:
cdbhammie.py cdbwrap.py
Log Message:
Taking out the cdb stuff, as I'm not going to persue it further.
It's in the attic now if anyone wants to mess with it later.
--- cdbhammie.py DELETED ---
--- cdbwrap.py DELETED ---
From tim_one@users.sourceforge.net Sat Sep 14 01:03:53 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Fri, 13 Sep 2002 17:03:53 -0700
Subject: [Spambayes-checkins]
spambayes README.txt,1.16,1.17 TestDriver.py,1.3,1.4 cmp.py,1.7,1.8
mboxtest.py,1.4,1.5 rates.py,1.3,1.4 timcv.py,1.3,1.4timtest.py,1.25,1.26
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv23741
Modified Files:
README.txt TestDriver.py cmp.py mboxtest.py rates.py timcv.py
timtest.py
Log Message:
Lots of small changes to support N-fold cross validation properly.
timcv.py now does this. The pragmatic problem with giant pickle memos
and giant deepcopy memos is gone -- instead the test driver has to
take more care to train and untrain appropriate pieces explicitly.
This is actually easy (see timcv).
TestDriver.Driver now prints statistics with a recognizable pattern
at the start of the line, so that rates.py doesn't feel so arbitrary
anymore. rates.py and cmp.py were changed accordingly. rates.py now
puts a lot more stuff in the summary, including accounts of how many ham
and spam were trained on, and predicted against, in each test run.
Driver() clients have to explictly tell Driver when they want a new
classifier now; I changed timtest and mboxtest to do that, but am
not set up to exercise mboxtest.
Driver, rates and cmp no longer make assumptions about the *kind* of
test being run, and work equally well for, e.g., NxN grids or N-fold
c-v.
rates.py also computes the average f-p and f-n rates now, and cmp.py
displays before-and-after values for those too. Average rates are
intended to be used when doing N-fold c-v; they make less sense
for an NxN test grid.
Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.16
retrieving revision 1.17
diff -C2 -d -r1.16 -r1.17
*** README.txt 13 Sep 2002 19:33:04 -0000 1.16
--- README.txt 14 Sep 2002 00:03:51 -0000 1.17
***************
*** 14,18 ****
later -- as is, the false positive rate has gotten too small to measure
reliably across test sets with 4000 hams + 2750 spams, but the false
! negative rate is still over 1%.
The code here depends in various ways on the latest Python from CVS
--- 14,19 ----
later -- as is, the false positive rate has gotten too small to measure
reliably across test sets with 4000 hams + 2750 spams, but the false
! negative rate is still over 1%. Later: the f-n rate has also gotten
! too small to measure reliably across that much training data.
The code here depends in various ways on the latest Python from CVS
***************
*** 47,56 ****
TestDriver.py
! A higher layer of test helpers, building on Tester above. It's
! quite usable as-is for building simple test drivers, and more
! complicated ones up to NxN test grids. It's in the process of being
! extended to allow easy building of N-way cross validation drivers
! (the trick to that is doing so efficiently). See also rates.py
! and cmp.py below.
--- 48,55 ----
TestDriver.py
! A flexible higher layer of test helpers, building on Tester above.
! For example, it's usable for building simple test drivers, NxN test
! grids, and N-fold cross validation drivers. See also rates.py and
! cmp.py below.
***************
*** 71,75 ****
A concrete test driver like mboxtest.py, but working with "a
standard" test data setup (see below) rather than the specialized
! mboxtest setup.
timcv.py
--- 70,74 ----
A concrete test driver like mboxtest.py, but working with "a
standard" test data setup (see below) rather than the specialized
! mboxtest setup. This runs an NxN test grid, skipping the diagonal.
timcv.py
***************
*** 82,92 ****
==============
rates.py
! Scans the output (so far) from timtest.py, and captures summary
! statistics.
cmp.py
Given two summary files produced by rates.py, displays an account
of all the f-p and f-n rates side-by-side, along with who won which
! (etc), and the change in total # of f-ps and f-n.
--- 81,92 ----
==============
rates.py
! Scans the output (so far) produced by TestDriver.Drive(), and captures
! summary statistics.
cmp.py
Given two summary files produced by rates.py, displays an account
of all the f-p and f-n rates side-by-side, along with who won which
! (etc), the change in total # of unique false positives and negatives,
! and the change in average f-p and f-n rates.
***************
*** 127,136 ****
Standard Test Data Setup
========================
- [Caution: I'm going to switch this to support N-way cross validation,
- instead of an NxN test grid. The only effect on the directory structure
- here is that you'll want more directories with fewer msgs in each
- (splitting the data at random into 10 pairs should work very well).
- ]
-
Barry gave me mboxes, but the spam corpus I got off the web had one spam
per file, and it only took two days of extreme pain to realize that one msg
--- 127,130 ----
***************
*** 142,145 ****
--- 136,142 ----
The directory structure under my spambayes directory looks like so:
+ [But due to a better testing infrastructure, I'm going to spread this
+ across 20 subdirectories under Spam and under Ham, and use groups
+ of 10 for 10-fold cross validation]
Data/
***************
*** 159,167 ****
If you use the same names and structure, huge mounds of the tedious testing
! code will work as-is. The more Set directories the merrier, although
! you'll hit a point of diminishing returns if you exceed 10. The "reservoir"
! directory contains a few thousand other random hams. When a ham is found
! that's really spam, I delete it, and then the rebal.py utility moves in a
! message at random from the reservoir to replace it. If I had it to do over
again, I think I'd move such spam into a Spam set (chosen at random),
instead of deleting it.
--- 156,164 ----
If you use the same names and structure, huge mounds of the tedious testing
! code will work as-is. The more Set directories the merrier, although you
! want at least a few hundred messages in each one. The "reservoir" directory
! contains a few thousand other random hams. When a ham is found that's
! really spam, I delete it, and then the rebal.py utility moves in a message
! at random from the reservoir to replace it. If I had it to do over
again, I think I'd move such spam into a Spam set (chosen at random),
instead of deleting it.
***************
*** 172,176 ****
! The sets are grouped into 5 pairs in the obvious way: Spam/Set1 with
Ham/Set1, and so on. For each such pair, timtest trains a classifier on
that pair, then runs predictions on each of the other 4 pairs. In effect,
--- 169,173 ----
! The sets are grouped into pairs in the obvious way: Spam/Set1 with
Ham/Set1, and so on. For each such pair, timtest trains a classifier on
that pair, then runs predictions on each of the other 4 pairs. In effect,
***************
*** 178,179 ****
--- 175,186 ----
to avoid predicting against the same set trained on, except that it
takes more time and seems the least interesting thing to try.
+
+ Later, support for N-fold cross validation testing was added, which allows
+ more accurate measurement of error rates with smaller amounts of training
+ data. That's recommended now.
+
+ CAUTION: The parititioning of your corpora across directories should
+ be random. If it isn't, bias creeps in to the test results. This is
+ usually screamingly obvious under the NxN grid method (rates vary by a
+ factor of 10 or more across training sets, and even within runs against
+ a single training set), but harder to spot using N-fold c-v.
Index: TestDriver.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** TestDriver.py 13 Sep 2002 19:33:04 -0000 1.3
--- TestDriver.py 14 Sep 2002 00:03:51 -0000 1.4
***************
*** 1,11 ****
# Loop:
! # # Set up a new base classifier for testing.
! # train(ham, spam)
# # Run tests against (possibly variants of) this classifier.
# Loop:
! # Optional:
! # # Forget training for some subset of ham and spam. This
! # # works against the base classifier trained at the start.
! # forget(ham, spam)
# # Predict against other data.
# Loop:
--- 1,15 ----
# Loop:
! # Optional:
! # # Set up a new base classifier for testing.
! # new_classifier()
# # Run tests against (possibly variants of) this classifier.
# Loop:
! # Loop:
! # Optional:
! # # train on more ham and spam
! # train(ham, spam)
! # Optional:
! # # Forget training for some subset of ham and spam.
! # untrain(ham, spam)
# # Predict against other data.
# Loop:
***************
*** 89,121 ****
self.global_spam_hist = Hist(options.nbuckets)
self.ntimes_finishtest_called = 0
! def train(self, ham, spam):
c = self.classifier = classifier.GrahamBayes()
! t = self.tester = Tester.Test(c)
!
! print "Training on", ham, "&", spam, "...",
! t.train(ham, spam)
! print c.nham, "hams &", c.nspam, "spams"
!
self.trained_ham_hist = Hist(options.nbuckets)
self.trained_spam_hist = Hist(options.nbuckets)
! def forget(self, ham, spam):
! import copy
!
! print " forgetting", ham, "&", spam, "...",
c = self.classifier
nham, nspam = c.nham, c.nspam
! c = copy.deepcopy(c)
! self.tester.set_classifier(c)
self.tester.untrain(ham, spam)
print nham - c.nham, "hams &", nspam - c.nspam, "spams"
- self.global_ham_hist += self.trained_ham_hist
- self.global_spam_hist += self.trained_spam_hist
- self.trained_ham_hist = Hist(options.nbuckets)
- self.trained_spam_hist = Hist(options.nbuckets)
-
def finishtest(self):
if options.show_histograms:
--- 93,118 ----
self.global_spam_hist = Hist(options.nbuckets)
self.ntimes_finishtest_called = 0
+ self.new_classifier()
! def new_classifier(self):
c = self.classifier = classifier.GrahamBayes()
! self.tester = Tester.Test(c)
self.trained_ham_hist = Hist(options.nbuckets)
self.trained_spam_hist = Hist(options.nbuckets)
! def train(self, ham, spam):
! print "-> Training on", ham, "&", spam, "...",
c = self.classifier
nham, nspam = c.nham, c.nspam
! self.tester.train(ham, spam)
! print c.nham - nham, "hams &", c.nspam- nspam, "spams"
+ def untrain(self, ham, spam):
+ print "-> Forgetting", ham, "&", spam, "...",
+ c = self.classifier
+ nham, nspam = c.nham, c.nspam
self.tester.untrain(ham, spam)
print nham - c.nham, "hams &", nspam - c.nspam, "spams"
def finishtest(self):
if options.show_histograms:
***************
*** 124,127 ****
--- 121,126 ----
self.global_ham_hist += self.trained_ham_hist
self.global_spam_hist += self.trained_spam_hist
+ self.trained_ham_hist = Hist(options.nbuckets)
+ self.trained_spam_hist = Hist(options.nbuckets)
self.ntimes_finishtest_called += 1
***************
*** 163,177 ****
t.reset_test_results()
! print " testing against", ham, "&", spam, "...",
t.predict(spam, True, new_spam)
t.predict(ham, False, new_ham)
! print t.nham_tested, "hams &", t.nspam_tested, "spams"
! print " false positive:", t.false_positive_rate()
! print " false negative:", t.false_negative_rate()
newfpos = Set(t.false_positives()) - self.falsepos
self.falsepos |= newfpos
! print " new false positives:", [e.tag for e in newfpos]
if not options.show_false_positives:
newfpos = ()
--- 162,179 ----
t.reset_test_results()
! print "-> Predicting", ham, "&", spam, "..."
t.predict(spam, True, new_spam)
t.predict(ham, False, new_ham)
! print "-> tested", t.nham_tested, "hams &", t.nspam_tested, \
! "spams against", c.nham, "hams &", c.nspam, "spams"
! print "-> false positive %:", t.false_positive_rate()
! print "-> false negative %:", t.false_negative_rate()
newfpos = Set(t.false_positives()) - self.falsepos
self.falsepos |= newfpos
! print "-> %d new false positives" % len(newfpos)
! if newfpos:
! print " new fp:", [e.tag for e in newfpos]
if not options.show_false_positives:
newfpos = ()
***************
*** 183,187 ****
newfneg = Set(t.false_negatives()) - self.falseneg
self.falseneg |= newfneg
! print " new false negatives:", [e.tag for e in newfneg]
if not options.show_false_negatives:
newfneg = ()
--- 185,191 ----
newfneg = Set(t.false_negatives()) - self.falseneg
self.falseneg |= newfneg
! print "-> %d new false negatives" % len(newfneg)
! if newfneg:
! print " new fn:", [e.tag for e in newfneg]
if not options.show_false_negatives:
newfneg = ()
Index: cmp.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/cmp.py,v
retrieving revision 1.7
retrieving revision 1.8
diff -C2 -d -r1.7 -r1.8
*** cmp.py 12 Sep 2002 19:35:14 -0000 1.7
--- cmp.py 14 Sep 2002 00:03:51 -0000 1.8
***************
*** 16,39 ****
# list of all f-n rates,
# total f-p,
! # total f-n)
# from summary file f.
def suck(f):
fns = []
fps = []
while 1:
! line = f.readline()
if line.startswith('total'):
break
! if not line.startswith('Training'):
! # A line with an f-p rate and an f-n rate.
! p, n = map(float, line.split())
! fps.append(p)
! fns.append(n)
! # "total false pos 8 0.04"
! # "total false neg 249 1.81090909091"
! fptot = int(line.split()[-2])
! fntot = int(f.readline().split()[-2])
! return fps, fns, fptot, fntot
def tag(p1, p2):
--- 16,49 ----
# list of all f-n rates,
# total f-p,
! # total f-n,
! # average f-p rate,
! # average f-n rate)
# from summary file f.
def suck(f):
fns = []
fps = []
+ get = f.readline
while 1:
! line = get()
! if line.startswith('-> tested'):
! print line,
! if line.startswith('-> '):
! continue
if line.startswith('total'):
break
! # A line with an f-p rate and an f-n rate.
! p, n = map(float, line.split())
! fps.append(p)
! fns.append(n)
! # "total unique false pos 0"
! # "total unique false neg 0"
! # "average fp % 0.0"
! # "average fn % 0.0"
! fptot = int(line.split()[-1])
! fntot = int(get().split()[-1])
! fpmean = float(get().split()[-1])
! fnmean = float(get().split()[-1])
! return fps, fns, fptot, fntot, fpmean, fnmean
def tag(p1, p2):
***************
*** 60,72 ****
print
- fp1, fn1, fptot1, fntot1 = suck(file(f1n + '.txt'))
- fp2, fn2, fptot2, fntot2 = suck(file(f2n + '.txt'))
print f1n, '->', f2n
print
print "false positive percentages"
dump(fp1, fp2)
print "total unique fp went from", fptot1, "to", fptot2, tag(fptot1, fptot2)
print
--- 70,84 ----
print
print f1n, '->', f2n
+ fp1, fn1, fptot1, fntot1, fpmean1, fnmean1 = suck(file(f1n + '.txt'))
+ fp2, fn2, fptot2, fntot2, fpmean2, fnmean2 = suck(file(f2n + '.txt'))
+
print
print "false positive percentages"
dump(fp1, fp2)
print "total unique fp went from", fptot1, "to", fptot2, tag(fptot1, fptot2)
+ print "mean fp % went from", fpmean1, "to", fpmean2, tag(fpmean1, fpmean2)
print
***************
*** 74,75 ****
--- 86,88 ----
dump(fn1, fn2)
print "total unique fn went from", fntot1, "to", fntot2, tag(fntot1, fntot2)
+ print "mean fn % went from", fnmean1, "to", fnmean2, tag(fnmean1, fnmean2)
Index: mboxtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/mboxtest.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** mboxtest.py 13 Sep 2002 16:26:58 -0000 1.4
--- mboxtest.py 14 Sep 2002 00:03:51 -0000 1.5
***************
*** 166,169 ****
--- 166,170 ----
for iham, ispam in testsets:
+ driver.new_classifier()
driver.train(mbox(ham, iham), mbox(spam, ispam))
for ihtest, istest in testsets:
Index: rates.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/rates.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** rates.py 12 Sep 2002 19:35:14 -0000 1.3
--- rates.py 14 Sep 2002 00:03:51 -0000 1.4
***************
*** 2,6 ****
"""
! rates.py basename
Assuming that file
--- 2,6 ----
"""
! rates.py basename ...
Assuming that file
***************
*** 19,38 ****
"""
- import re
import sys
"""
! Training on Data/Ham/Set1 & Data/Spam/Set1 ... 4000 hams & 2750 spams
! testing against Data/Ham/Set2 & Data/Spam/Set2 ... 4000 hams & 2750 spams
! false positive: 0.025
! false negative: 1.34545454545
! new false positives: ['Data/Ham/Set2/66645.txt']
"""
- pat1 = re.compile(r'\s*Training on ').match
- pat2 = re.compile(r'\s+false (positive|negative): (.*)').match
- pat3 = re.compile(r"\s+new false (positives|negatives): \[(.+)\]").match
def doit(basename):
ifile = file(basename + '.txt')
oname = basename + 's.txt'
ofile = file(oname, 'w')
--- 19,38 ----
"""
import sys
"""
! -> Training on Data/Ham/Set2-3 & Data/Spam/Set2-3 ... 8000 hams & 5500 spams
! -> Predicting Data/Ham/Set1 & Data/Spam/Set1 ...
! -> tested 4000 hams & 2750 spams against 8000 hams & 5500 spams
! -> false positive %: 0.025
! -> false negative %: 0.327272727273
! -> 1 new false positives
"""
def doit(basename):
ifile = file(basename + '.txt')
+ interesting = filter(lambda line: line.startswith('-> '), ifile)
+ ifile.close()
+
oname = basename + 's.txt'
ofile = file(oname, 'w')
***************
*** 44,83 ****
print >> ofile, msg
! nfn = nfp = 0
ntrainedham = ntrainedspam = 0
! for line in ifile:
! "Training on Data/Ham/Set1 & Data/Spam/Set1 ... 4000 hams & 2750 spams"
! m = pat1(line)
! if m:
! dump(line[:-1])
! fields = line.split()
ntrainedham += int(fields[-5])
ntrainedspam += int(fields[-2])
continue
! "false positive: 0.025"
! "false negative: 1.34545454545"
! m = pat2(line)
! if m:
! kind, guts = m.groups()
! guts = float(guts)
if kind == 'positive':
! lastval = guts
else:
! dump(' %7.3f %7.3f' % (lastval, guts))
continue
! "new false positives: ['Data/Ham/Set2/66645.txt']"
! m = pat3(line)
! if m: # note that it doesn't match at all if the list is "[]"
! kind, guts = m.groups()
! n = len(guts.split())
if kind == 'positives':
! nfp += n
else:
! nfn += n
! dump('total false pos', nfp, nfp * 1e2 / ntrainedham)
! dump('total false neg', nfn, nfn * 1e2 / ntrainedspam)
for name in sys.argv[1:]:
--- 44,91 ----
print >> ofile, msg
! ntests = nfn = nfp = 0
! sumfnrate = sumfprate = 0.0
ntrainedham = ntrainedspam = 0
!
! for line in interesting:
! dump(line[:-1])
! fields = line.split()
!
! # 0 1 2 3 4 5 6 -5 -4 -3 -2 -1
! #-> tested 4000 hams & 2750 spams against 8000 hams & 5500 spams
! if line.startswith('-> tested '):
ntrainedham += int(fields[-5])
ntrainedspam += int(fields[-2])
+ ntests += 1
continue
! # 0 1 2 3
! # -> false positive %: 0.025
! # -> false negative %: 0.327272727273
! if line.startswith('-> false '):
! kind = fields[3]
! percent = float(fields[-1])
if kind == 'positive':
! sumfprate += percent
! lastval = percent
else:
! sumfnrate += percent
! dump(' %7.3f %7.3f' % (lastval, percent))
continue
! # 0 1 2 3 4 5
! # -> 1 new false positives
! if fields[3] == 'new' and fields[4] == 'false':
! kind = fields[-1]
! count = int(fields[2])
if kind == 'positives':
! nfp += count
else:
! nfn += count
! dump('total unique false pos', nfp)
! dump('total unique false neg', nfn)
! dump('average fp %', sumfprate / ntests)
! dump('average fn %', sumfnrate / ntests)
for name in sys.argv[1:]:
Index: timcv.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timcv.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** timcv.py 13 Sep 2002 20:35:37 -0000 1.3
--- timcv.py 14 Sep 2002 00:03:51 -0000 1.4
***************
*** 77,85 ****
d = Driver()
! # Train it on all the data.
! d.train(MsgStream("%s-%d" % (hamdirs[0], nsets), hamdirs),
! MsgStream("%s-%d" % (spamdirs[0], nsets), spamdirs))
! # Now run nsets times, removing one pair per run.
for i in range(nsets):
h = hamdirs[i]
--- 77,85 ----
d = Driver()
! # Train it on all sets except the first.
! d.train(MsgStream("%s-%d" % (hamdirs[1], nsets), hamdirs[1:]),
! MsgStream("%s-%d" % (spamdirs[1], nsets), spamdirs[1:]))
! # Now run nsets times, predicting pair i against all except pair i.
for i in range(nsets):
h = hamdirs[i]
***************
*** 87,93 ****
hamstream = MsgStream(h, [h])
spamstream = MsgStream(s, [s])
! d.forget(hamstream, spamstream)
d.test(hamstream, spamstream)
d.finishtest()
d.alldone()
--- 87,103 ----
hamstream = MsgStream(h, [h])
spamstream = MsgStream(s, [s])
!
! if i > 0:
! # Forget this set.
! d.untrain(hamstream, spamstream)
!
! # Predict this set.
d.test(hamstream, spamstream)
d.finishtest()
+
+ if i < nsets - 1:
+ # Add this set back in.
+ d.train(hamstream, spamstream)
+
d.alldone()
Index: timtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtest.py,v
retrieving revision 1.25
retrieving revision 1.26
diff -C2 -d -r1.25 -r1.26
*** timtest.py 13 Sep 2002 18:48:42 -0000 1.25
--- timtest.py 14 Sep 2002 00:03:51 -0000 1.26
***************
*** 74,78 ****
random.seed(hash(directory))
random.shuffle(all)
! for fname in all[-1500:-1000:]:
yield Msg(directory, fname)
--- 74,78 ----
random.seed(hash(directory))
random.shuffle(all)
! for fname in all[-1500:-1300:]:
yield Msg(directory, fname)
***************
*** 89,92 ****
--- 89,93 ----
d = Driver()
for spamdir, hamdir in spamhamdirs:
+ d.new_classifier()
d.train(MsgStream(hamdir), MsgStream(spamdir))
for sd2, hd2 in spamhamdirs:
From tim_one@users.sourceforge.net Sat Sep 14 04:32:49 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Fri, 13 Sep 2002 20:32:49 -0700
Subject: [Spambayes-checkins] spambayes rebal.py,1.2,1.3
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv832
Modified Files:
rebal.py
Log Message:
migrate(): If there's a file extension, preserve it instead of blowing
up (files w/o extensions are a PITA on Windows). Also replaced the
renaming strategy w/ a randomized scheme that should run much faster.
Index: rebal.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/rebal.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** rebal.py 12 Sep 2002 19:33:54 -0000 1.2
--- rebal.py 14 Sep 2002 03:32:47 -0000 1.3
***************
*** 18,22 ****
must already exist.
! Example:
rebal.py -r reservoir -s Set -n 300
--- 18,22 ----
must already exist.
! Example:
rebal.py -r reservoir -s Set -n 300
***************
*** 64,83 ****
-Q - be quiet and don't confirm moves
""" % globals()
!
def migrate(f, dir, verbose):
"""rename f into dir, making sure to avoid name clashes."""
base = os.path.split(f)[-1]
! if os.path.exists(os.path.join(dir,base)):
! # this path can get slow if we have a lot of name collisions
! # but we should rarely encounter that case (so he says smugly)
! reslist = [int(n) for n in os.listdir(dir)]
! reslist.sort()
! out = os.path.join(dir, "%d"%(reslist[-1]+1))
! else:
! out = os.path.join(dir, base)
if verbose:
print "moving", f, "to", out
os.rename(f, out)
!
def main(args):
nperdir = NPERDIR
--- 64,80 ----
-Q - be quiet and don't confirm moves
""" % globals()
!
def migrate(f, dir, verbose):
"""rename f into dir, making sure to avoid name clashes."""
base = os.path.split(f)[-1]
! out = os.path.join(dir, base)
! while os.path.exists(out):
! basename, ext = os.path.splitext(base)
! digits = random.randrange(100000000)
! out = os.path.join(dir, str(digits) + ext)
if verbose:
print "moving", f, "to", out
os.rename(f, out)
!
def main(args):
nperdir = NPERDIR
***************
*** 86,90 ****
verbose = VERBOSE
confirm = CONFIRM
!
try:
opts, args = getopt.getopt(args, "r:s:n:vqcQh")
--- 83,87 ----
verbose = VERBOSE
confirm = CONFIRM
!
try:
opts, args = getopt.getopt(args, "r:s:n:vqcQh")
From tim_one@users.sourceforge.net Sat Sep 14 21:08:09 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sat, 14 Sep 2002 13:08:09 -0700
Subject: [Spambayes-checkins]
spambayes Options.py,1.13,1.14 timcv.py,1.4,1.5 tokenizer.py,1.21,1.22
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv11259
Modified Files:
Options.py timcv.py tokenizer.py
Log Message:
New option [Tokenizer]ignore_redundant_html, defaulting to False. This
may change results! Read the comments in tokenizer.py and Options.py.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.13
retrieving revision 1.14
diff -C2 -d -r1.13 -r1.14
*** Options.py 13 Sep 2002 16:26:58 -0000 1.13
--- Options.py 14 Sep 2002 20:08:07 -0000 1.14
***************
*** 13,21 ****
defaults = """
[Tokenizer]
! # By default, tokenizer.Tokenizer.tokenize_headers() strips HTML tags
! # from pure text/html messages. Set to True to retain HTML tags in
! # this case.
retain_pure_html_tags: False
# Generate tokens just counting the number of instances of each kind of
# header line, in a case-sensitive way.
--- 13,33 ----
defaults = """
[Tokenizer]
! # If false, tokenizer.Tokenizer.tokenize_body() strips HTML tags
! # from pure text/html messages. Set true to retain HTML tags in this
! # case. On the c.l.py corpus, it helps to set this true because any
! # sign of HTML is so despised on tech lists; however, the advantage
! # of setting it true eventually vanishes even there given enough
! # training data. If you set this true, you should almost certainly set
! # ignore_redundant_html true too.
retain_pure_html_tags: False
+ # If true, when a multipart/alternative has both text/plain and text/html
+ # sections, the text/html section is ignored. That's likely a dubious
+ # idea in general, so false is likely a better idea here. In the c.l.py
+ # tests, it helped a lot when retain_pure_html_tags was true (in that case,
+ # keeping the HTML tags in the "redundant" HTML was almost certain to score
+ # the multipart/alternative as spam, regardless of content).
+ ignore_redundant_html: False
+
# Generate tokens just counting the number of instances of each kind of
# header line, in a case-sensitive way.
***************
*** 116,119 ****
--- 128,132 ----
all_options = {
'Tokenizer': {'retain_pure_html_tags': boolean_cracker,
+ 'ignore_redundant_html': boolean_cracker,
'safe_headers': ('get', lambda s: Set(s.split())),
'count_all_header_lines': boolean_cracker,
Index: timcv.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timcv.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** timcv.py 14 Sep 2002 00:03:51 -0000 1.4
--- timcv.py 14 Sep 2002 20:08:07 -0000 1.5
***************
*** 67,70 ****
--- 67,80 ----
yield Msg(directory, fname)
+ def xproduce(self):
+ import random
+ keep = 'Spam' in self.directories[0] and 300 or 300
+ for directory in self.directories:
+ all = os.listdir(directory)
+ random.seed(hash(max(all)) ^ 0x12345678) # reproducible across calls
+ random.shuffle(all)
+ for fname in all[:keep]:
+ yield Msg(directory, fname)
+
def __iter__(self):
return self.produce()
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.21
retrieving revision 1.22
diff -C2 -d -r1.21 -r1.22
*** tokenizer.py 13 Sep 2002 02:40:50 -0000 1.21
--- tokenizer.py 14 Sep 2002 20:08:07 -0000 1.22
***************
*** 470,473 ****
--- 470,482 ----
# of c.l.py traffic. Again, this should be revisited if the f-n rate is
# slashed again.
+ #
+ # Later: As the amount of training data increased, the effect of retaining
+ # HTML tags decreased to insignificance. options.retain_pure_html_tags
+ # was introduced to control this, and it defaults to False.
+ #
+ # Later: The decision to ignore "redundant" HTML is also dubious, since
+ # the text/plain and text/html alternatives may have entirely different
+ # content. options.ignore_redundant_html was introduced to control this,
+ # and it defaults to False.
##############################################################################
***************
*** 492,531 ****
! # Find all the text components of the msg. There's no point decoding
! # binary blobs (like images). If a multipart/alternative has both plain
! # text and HTML versions of a msg, ignore the HTML part: HTML decorations
! # have monster-high spam probabilities, and innocent newbies often post
! # using HTML.
! def textparts(msg):
! text = Set()
! redundant_html = Set()
! for part in msg.walk():
! if part.get_content_type() == 'multipart/alternative':
! # Descend this part of the tree, adding any redundant HTML text
! # part to redundant_html.
! htmlpart = textpart = None
! stack = part.get_payload()[:]
! while stack:
! subpart = stack.pop()
! ctype = subpart.get_content_type()
! if ctype == 'text/plain':
! textpart = subpart
! elif ctype == 'text/html':
! htmlpart = subpart
! elif ctype == 'multipart/related':
! stack.extend(subpart.get_payload())
! if textpart is not None:
! text.add(textpart)
! if htmlpart is not None:
! redundant_html.add(htmlpart)
! elif htmlpart is not None:
! text.add(htmlpart)
! elif part.get_content_maintype() == 'text':
! text.add(part)
! return text - redundant_html
url_re = re.compile(r"""
--- 501,548 ----
+ # textparts(msg) returns a set containing all the text components of msg.
+ # There's no point decoding binary blobs (like images).
! if options.ignore_redundant_html:
! # If a multipart/alternative has both plain text and HTML versions of a
! # msg, ignore the HTML part: HTML decorations have monster-high spam
! # probabilities, and innocent newbies often post using HTML.
! def textparts(msg):
! text = Set()
! redundant_html = Set()
! for part in msg.walk():
! if part.get_content_type() == 'multipart/alternative':
! # Descend this part of the tree, adding any redundant HTML text
! # part to redundant_html.
! htmlpart = textpart = None
! stack = part.get_payload()[:]
! while stack:
! subpart = stack.pop()
! ctype = subpart.get_content_type()
! if ctype == 'text/plain':
! textpart = subpart
! elif ctype == 'text/html':
! htmlpart = subpart
! elif ctype == 'multipart/related':
! stack.extend(subpart.get_payload())
! if textpart is not None:
! text.add(textpart)
! if htmlpart is not None:
! redundant_html.add(htmlpart)
! elif htmlpart is not None:
! text.add(htmlpart)
! elif part.get_content_maintype() == 'text':
! text.add(part)
! return text - redundant_html
!
! else:
! # Use all text parts. If a text/plain and text/html part happen to
! # have redundant content, so it goes.
! def textparts(msg):
! return Set(filter(lambda part: part.get_content_maintype() == 'text',
! msg.walk()))
url_re = re.compile(r"""
From tim_one@users.sourceforge.net Sat Sep 14 23:01:46 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sat, 14 Sep 2002 15:01:46 -0700
Subject: [Spambayes-checkins] spambayes timcv.py,1.5,1.6
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv8230
Modified Files:
timcv.py
Log Message:
Introduced new optional arguments to use only part of the ham and spam
in each set. This helps those with larger corpora to run tests as if
they had less.
Index: timcv.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timcv.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** timcv.py 14 Sep 2002 20:08:07 -0000 1.5
--- timcv.py 14 Sep 2002 22:01:42 -0000 1.6
***************
*** 4,8 ****
# A driver for N-fold cross validation.
! """Usage: %(program)s [-h] -n nsets
Where:
--- 4,8 ----
# A driver for N-fold cross validation.
! """Usage: %(program)s [options] -n nsets
Where:
***************
*** 13,16 ****
--- 13,31 ----
This is required.
+ If you only want to use some of the messages in each set,
+
+ --ham-keep int
+ The maximum number of msgs to use from each Ham set. The msgs are
+ chosen randomly. See also the -s option.
+
+ --spam-keep int
+ The maximum number of msgs to use from each Spam set. The msgs are
+ chosen randomly. See also the -s option.
+
+ -s int
+ A seed for the random number generator. Has no effect unless
+ at least on of {--ham-keep, --spam-keep} is specified. If -s
+ isn't specifed, the seed is taken from current time.
+
In addition, an attempt is made to merge bayescustomize.ini into the options.
If that exists, it can be used to change the settings in Options.options.
***************
*** 19,26 ****
import os
import sys
from Options import options
from tokenizer import tokenize
! from TestDriver import Driver
program = sys.argv[0]
--- 34,46 ----
import os
import sys
+ import random
from Options import options
from tokenizer import tokenize
! import TestDriver
!
! HAMKEEP = None
! SPAMKEEP = None
! SEED = random.randrange(2000000000)
program = sys.argv[0]
***************
*** 35,38 ****
--- 55,60 ----
class Msg(object):
+ __slots__ = 'tag', 'guts'
+
def __init__(self, dir, name):
path = dir + "/" + name
***************
*** 45,48 ****
--- 67,71 ----
return tokenize(self.guts)
+ # Compare msgs by their paths; this is appropriate for sets of msgs.
def __hash__(self):
return hash(self.tag)
***************
*** 55,61 ****
class MsgStream(object):
! def __init__(self, tag, directories):
self.tag = tag
self.directories = directories
def __str__(self):
--- 78,87 ----
class MsgStream(object):
! __slots__ = 'tag', 'directories', 'keep'
!
! def __init__(self, tag, directories, keep=None):
self.tag = tag
self.directories = directories
+ self.keep = keep
def __str__(self):
***************
*** 63,78 ****
def produce(self):
! for directory in self.directories:
! for fname in os.listdir(directory):
! yield Msg(directory, fname)
!
! def xproduce(self):
! import random
! keep = 'Spam' in self.directories[0] and 300 or 300
for directory in self.directories:
all = os.listdir(directory)
! random.seed(hash(max(all)) ^ 0x12345678) # reproducible across calls
random.shuffle(all)
! for fname in all[:keep]:
yield Msg(directory, fname)
--- 89,107 ----
def produce(self):
! if self.keep is None:
! for directory in self.directories:
! for fname in os.listdir(directory):
! yield Msg(directory, fname)
! return
! # We only want part of the msgs. Shuffle each directory list, but
! # in such a way that we'll get the same result each time this is
! # called on the same directory list.
for directory in self.directories:
all = os.listdir(directory)
! random.seed(hash(max(all)) ^ SEED) # reproducible across calls
random.shuffle(all)
! del all[self.keep:]
! all.sort() # seems to speed access on Win98!
! for fname in all:
yield Msg(directory, fname)
***************
*** 80,83 ****
--- 109,120 ----
return self.produce()
+ class HamStream(MsgStream):
+ def __init__(self, tag, directories):
+ MsgStream.__init__(self, tag, directories, HAMKEEP)
+
+ class SpamStream(MsgStream):
+ def __init__(self, tag, directories):
+ MsgStream.__init__(self, tag, directories, SPAMKEEP)
+
def drive(nsets):
print options.display()
***************
*** 86,93 ****
spamdirs = ["Data/Spam/Set%d" % i for i in range(1, nsets+1)]
! d = Driver()
# Train it on all sets except the first.
! d.train(MsgStream("%s-%d" % (hamdirs[1], nsets), hamdirs[1:]),
! MsgStream("%s-%d" % (spamdirs[1], nsets), spamdirs[1:]))
# Now run nsets times, predicting pair i against all except pair i.
--- 123,130 ----
spamdirs = ["Data/Spam/Set%d" % i for i in range(1, nsets+1)]
! d = TestDriver.Driver()
# Train it on all sets except the first.
! d.train(HamStream("%s-%d" % (hamdirs[1], nsets), hamdirs[1:]),
! SpamStream("%s-%d" % (spamdirs[1], nsets), spamdirs[1:]))
# Now run nsets times, predicting pair i against all except pair i.
***************
*** 95,100 ****
h = hamdirs[i]
s = spamdirs[i]
! hamstream = MsgStream(h, [h])
! spamstream = MsgStream(s, [s])
if i > 0:
--- 132,137 ----
h = hamdirs[i]
s = spamdirs[i]
! hamstream = HamStream(h, [h])
! spamstream = SpamStream(s, [s])
if i > 0:
***************
*** 112,124 ****
d.alldone()
! if __name__ == "__main__":
import getopt
try:
! opts, args = getopt.getopt(sys.argv[1:], 'hn:')
except getopt.error, msg:
usage(1, msg)
! nsets = None
for opt, arg in opts:
if opt == '-h':
--- 149,163 ----
d.alldone()
! def main():
! global SEED, HAMKEEP, SPAMKEEP
import getopt
try:
! opts, args = getopt.getopt(sys.argv[1:], 'hn:s:',
! ['ham-keep=', 'spam-keep='])
except getopt.error, msg:
usage(1, msg)
! nsets = seed = None
for opt, arg in opts:
if opt == '-h':
***************
*** 126,129 ****
--- 165,174 ----
elif opt == '-n':
nsets = int(arg)
+ elif opt == '-s':
+ seed = int(arg)
+ elif opt == '--ham-keep':
+ HAMKEEP = int(arg)
+ elif opt == '--spam-keep':
+ SPAMKEEP = int(arg)
if args:
***************
*** 131,134 ****
--- 176,184 ----
if nsets is None:
usage(1, "-n is required")
+ if seed is not None:
+ SEED = seed
drive(nsets)
+
+ if __name__ == "__main__":
+ main()
From tim_one@users.sourceforge.net Sat Sep 14 23:18:27 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sat, 14 Sep 2002 15:18:27 -0700
Subject: [Spambayes-checkins] spambayes README.txt,1.17,1.18
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv12179
Modified Files:
README.txt
Log Message:
Various comment updates.
Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.17
retrieving revision 1.18
diff -C2 -d -r1.17 -r1.18
*** README.txt 14 Sep 2002 00:03:51 -0000 1.17
--- README.txt 14 Sep 2002 22:18:24 -0000 1.18
***************
*** 105,108 ****
--- 105,112 ----
the script for an operational definition of "loose".
+ rebal.py
+ Evens out the number of messages in "standard" test data folders (see
+ below). Needs generalization (e.g., Ham and 4000 are hardcoded now).
+
mboxcount.py
Count the number of messages (both parseable and unparseable) in
***************
*** 117,127 ****
Like splitn.py (above), but splits an mbox into one message per file in
"the standard" directory structure (see below). This does an
! approximate split; rebal.by (below) can be used afterwards to even out
the number of messages per folder.
- rebal.py
- Evens out the number of messages in "standard" test data folders (see
- below). Needs generalization (e.g., Ham and 4000 are hardcoded now).
-
Standard Test Data Setup
--- 121,127 ----
Like splitn.py (above), but splits an mbox into one message per file in
"the standard" directory structure (see below). This does an
! approximate split; rebal.py (above) can be used afterwards to even out
the number of messages per folder.
Standard Test Data Setup
***************
*** 133,156 ****
random when testing reveals spam mistakenly called ham (and vice versa),
etc -- even pasting examples into email is much easier when it's one msg
! per file (and the test driver makes it easy to print a msg's file path).
The directory structure under my spambayes directory looks like so:
- [But due to a better testing infrastructure, I'm going to spread this
- across 20 subdirectories under Spam and under Ham, and use groups
- of 10 for 10-fold cross validation]
Data/
Spam/
! Set1/ (contains 2750 spam .txt files)
Set2/ ""
Set3/ ""
Set4/ ""
Set5/ ""
Ham/
! Set1/ (contains 4000 ham .txt files)
Set2/ ""
Set3/ ""
Set4/ ""
Set5/ ""
reservoir/ (contains "backup ham")
--- 133,163 ----
random when testing reveals spam mistakenly called ham (and vice versa),
etc -- even pasting examples into email is much easier when it's one msg
! per file (and the test drivers make it easy to print a msg's file path).
The directory structure under my spambayes directory looks like so:
Data/
Spam/
! Set1/ (contains 1375 spam .txt files)
Set2/ ""
Set3/ ""
Set4/ ""
Set5/ ""
+ Set6/ ""
+ Set7/ ""
+ Set9/ ""
+ Set9/ ""
+ Set10/ ""
Ham/
! Set1/ (contains 2000 ham .txt files)
Set2/ ""
Set3/ ""
Set4/ ""
Set5/ ""
+ Set6/ ""
+ Set7/ ""
+ Set8/ ""
+ Set9/ ""
+ Set10/ ""
reservoir/ (contains "backup ham")
***************
*** 159,166 ****
want at least a few hundred messages in each one. The "reservoir" directory
contains a few thousand other random hams. When a ham is found that's
! really spam, I delete it, and then the rebal.py utility moves in a message
! at random from the reservoir to replace it. If I had it to do over
! again, I think I'd move such spam into a Spam set (chosen at random),
! instead of deleting it.
The hams are 20,000 msgs selected at random from a python-list archive.
--- 166,171 ----
want at least a few hundred messages in each one. The "reservoir" directory
contains a few thousand other random hams. When a ham is found that's
! really spam, move into a spam directory, and then the rebal.py utility
! moves in a random message from the reservoir to replace it.
The hams are 20,000 msgs selected at random from a python-list archive.
***************
*** 171,176 ****
The sets are grouped into pairs in the obvious way: Spam/Set1 with
Ham/Set1, and so on. For each such pair, timtest trains a classifier on
! that pair, then runs predictions on each of the other 4 pairs. In effect,
! it's a 5x5 test grid, skipping the diagonal. There's no particular reason
to avoid predicting against the same set trained on, except that it
takes more time and seems the least interesting thing to try.
--- 176,181 ----
The sets are grouped into pairs in the obvious way: Spam/Set1 with
Ham/Set1, and so on. For each such pair, timtest trains a classifier on
! that pair, then runs predictions on each of the other pairs. In effect,
! it's a NxN test grid, skipping the diagonal. There's no particular reason
to avoid predicting against the same set trained on, except that it
takes more time and seems the least interesting thing to try.
***************
*** 178,182 ****
Later, support for N-fold cross validation testing was added, which allows
more accurate measurement of error rates with smaller amounts of training
! data. That's recommended now.
CAUTION: The parititioning of your corpora across directories should
--- 183,189 ----
Later, support for N-fold cross validation testing was added, which allows
more accurate measurement of error rates with smaller amounts of training
! data. That's recommended now. timcv.py is to cross-validation testing
! as the older timtest.py is to grid testing. timcv.py has grown additional
! arguments to allow using only a random subset of messages in each Set.
CAUTION: The parititioning of your corpora across directories should
From tim_one@users.sourceforge.net Sun Sep 15 01:01:50 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sat, 14 Sep 2002 17:01:50 -0700
Subject: [Spambayes-checkins] spambayes Options.py,1.14,1.15
classifier.py,1.9,1.10
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv6591
Modified Files:
Options.py classifier.py
Log Message:
New bool option [Classifier]adjust_probs_by_evidence_mass. See the
mailing list for details. By default, this is turned off.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.14
retrieving revision 1.15
diff -C2 -d -r1.14 -r1.15
*** Options.py 14 Sep 2002 20:08:07 -0000 1.14
--- Options.py 15 Sep 2002 00:01:48 -0000 1.15
***************
*** 119,122 ****
--- 119,126 ----
max_discriminators: 16
+
+ # Speculative change to allow giving probabilities more weight the more
+ # messages went into computing them.
+ adjust_probs_by_evidence_mass: False
"""
***************
*** 152,155 ****
--- 156,160 ----
'unknown_spamprob': float_cracker,
'max_discriminators': int_cracker,
+ 'adjust_probs_by_evidence_mass': boolean_cracker,
},
}
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.9
retrieving revision 1.10
diff -C2 -d -r1.9 -r1.10
*** classifier.py 13 Sep 2002 19:46:41 -0000 1.9
--- classifier.py 15 Sep 2002 00:01:48 -0000 1.10
***************
*** 547,550 ****
--- 547,551 ----
nham = float(self.nham or 1)
nspam = float(self.nspam or 1)
+ fiddle = options.adjust_probs_by_evidence_mass
for word,record in self.wordinfo.iteritems():
# Compute prob(msg is spam | msg contains word).
***************
*** 560,570 ****
prob = MAX_SPAMPROB
!
! ## if prob != 0.5:
! ## confbias = 0.01 / (record.hamcount + record.spamcount)
! ## if prob > 0.5:
! ## prob = max(0.5, prob - confbias)
! ## else:
! ## prob = min(0.5, prob + confbias)
if record.spamprob != prob:
--- 561,581 ----
prob = MAX_SPAMPROB
! if fiddle:
! # Suppose two clues have spamprob 0.99. Which one is better?
! # One reasonable guess is that it's the one derived from the
! # most data. This code fiddles non-0.5 probabilities by
! # shrinking their distance to 0.5, but shrinking less the
! # more evidence went into computing them. Note that if this
! # proves to work, it should allow getting rid of the
! # "cancelling evidence" complications in spamprob()
! # (two probs exactly the same distance from 0.5 are far
! # less common after this transformation; instead, spamprob()
! # will pick up on the clues with the most evidence backing
! # them up).
! dist = prob - 0.5
! if dist:
! sum = float(record.hamcount + record.spamcount)
! dist *= sum / (sum + 1.0)
! prob = 0.5 + dist
if record.spamprob != prob:
From tim_one@users.sourceforge.net Sun Sep 15 08:45:33 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 15 Sep 2002 00:45:33 -0700
Subject: [Spambayes-checkins] spambayes classifier.py,1.10,1.11
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv8018
Modified Files:
classifier.py
Log Message:
update_probabilities: rearranged the base computation to make more
sense, and refined the optional "evidence mass" fiddling. To try this
as intended, you have to change *four* classifier options at the same
time:
[Classifier]
adjust_probs_by_evidence_mass: True
min_spamprob: 0.001
max_spamprob: 0.999
hambias: 1.5
See discussion on the mailing list.
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.10
retrieving revision 1.11
diff -C2 -d -r1.10 -r1.11
*** classifier.py 15 Sep 2002 00:01:48 -0000 1.10
--- classifier.py 15 Sep 2002 07:45:31 -0000 1.11
***************
*** 550,557 ****
for word,record in self.wordinfo.iteritems():
# Compute prob(msg is spam | msg contains word).
! hamcount = HAMBIAS * record.hamcount
! spamcount = SPAMBIAS * record.spamcount
! hamratio = min(1.0, hamcount / nham)
! spamratio = min(1.0, spamcount / nspam)
prob = spamratio / (hamratio + spamratio)
--- 550,557 ----
for word,record in self.wordinfo.iteritems():
# Compute prob(msg is spam | msg contains word).
! hamcount = min(HAMBIAS * record.hamcount, nham)
! spamcount = min(SPAMBIAS * record.spamcount, nspam)
! hamratio = hamcount / nham
! spamratio = spamcount / nspam
prob = spamratio / (hamratio + spamratio)
***************
*** 574,581 ****
# them up).
dist = prob - 0.5
! if dist:
! sum = float(record.hamcount + record.spamcount)
! dist *= sum / (sum + 1.0)
! prob = 0.5 + dist
if record.spamprob != prob:
--- 574,580 ----
# them up).
dist = prob - 0.5
! sum = hamcount + spamcount
! dist *= sum / (sum + 0.1)
! prob = 0.5 + dist
if record.spamprob != prob:
From richiehindle@users.sourceforge.net Mon Sep 16 08:57:22 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Mon, 16 Sep 2002 00:57:22 -0700
Subject: [Spambayes-checkins] spambayes pop3proxy.py,NONE,1.1
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv5459
Added Files:
pop3proxy.py
Log Message:
pop3proxy.py is a spam-classifying POP3 proxy, plus associated test code.
--- NEW FILE: pop3proxy.py ---
#!/usr/bin/env python
# pop3proxy is released under the terms of the following MIT-style license:
#
# Copyright (c) Entrian Solutions 2002
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.
"""A POP3 proxy designed to work with classifier.py, to add an X-Bayes-Score
header to each incoming email. The header gives a floating point number
between 0.00 and 1.00, to two decimal places. You point pop3proxy at your
POP3 server, and configure your email client to collect mail from the proxy
and filter on the X-Bayes-Score header. Usage:
pop3proxy.py [options] []
is the name of your real POP3 server
is the port number of your real POP3 server, which
defaults to 110.
options (the same as hammie):
-p FILE : use the named data file
-d : the file is a DBM file rather than a pickle
pop3proxy -t
Runs a test POP3 server on port 8110; useful for testing.
pop3proxy -h
Displays this help message.
For safety, and to help debugging, the whole POP3 conversation is written
out to _pop3proxy.log for each run.
"""
import sys, re, operator, errno, getopt, cPickle, socket, asyncore, asynchat
import classifier, tokenizer, hammie
from classifier import GrahamBayes, WordInfo # So we can unpickle these.
HEADER_FORMAT = 'X-Bayes-Score: %1.2f\r\n'
HEADER_EXAMPLE = 'X-Bayes-Score: 0.12\r\n'
class Listener( asyncore.dispatcher ):
"""Listens for incoming socket connections and spins off dispatchers
created by a factory callable."""
def __init__( self, port, factory, factoryArgs=(),
socketMap=asyncore.socket_map ):
asyncore.dispatcher.__init__( self, map=socketMap )
self.socketMap = socketMap
self.factory = factory
self.factoryArgs = factoryArgs
s = socket.socket( socket.AF_INET, socket.SOCK_STREAM )
s.setblocking( False )
self.set_socket( s, socketMap )
self.set_reuse_addr()
self.bind( ( '', port ) )
self.listen( 5 )
def handle_accept( self ):
clientSocket, clientAddress = self.accept()
args = [ clientSocket ] + list( self.factoryArgs )
if self.socketMap != asyncore.socket_map:
self.factory( *args, **{ 'socketMap': self.socketMap } )
else:
self.factory( *args )
class POP3ProxyBase( asynchat.async_chat ):
"""An async dispatcher that understands POP3 and proxies to a POP3
server, calling `self.onTransaction( request, response )` for each
transaction. Responses are not un-byte-stuffed before reaching
self.onTransaction() (they probably should be for a totally generic
POP3ProxyBase class, but BayesProxy doesn't need it and it would mean
re-stuffing them afterwards). self.onTransaction() should return the
response to pass back to the email client - the response can be the
verbatim response or a processed version of it. The special command
'KILL' kills it (passing a 'QUIT' command to the server)."""
def __init__( self, clientSocket, serverName, serverPort ):
asynchat.async_chat.__init__( self, clientSocket )
self.request = ''
self.isClosing = False
self.set_terminator( '\r\n' )
serverSocket = socket.socket( socket.AF_INET, socket.SOCK_STREAM )
serverSocket.connect( ( serverName, serverPort ) )
self.serverFile = serverSocket.makefile()
self.push( self.serverFile.readline() )
def handle_connect( self ):
"""Suppress the asyncore "unhandled connect event" warning."""
pass
def onTransaction( self, command, args, response ):
"""Overide this. Takes the raw request and the response, and
returns the (possibly processed) response to pass back to the
email client."""
raise NotImplementedError
def isMultiline( self, command, args ):
"""Returns True if the given request should get a multiline response
(assuming the response is positive)."""
if command in [ 'USER', 'PASS', 'APOP', 'QUIT',
'STAT', 'DELE', 'NOOP', 'RSET', 'KILL' ]:
return False
elif command in [ 'RETR', 'TOP' ]:
return True
elif command in [ 'LIST', 'UIDL' ]:
return len( args ) == 0
else:
# Assume that unknown commands will get an error response.
return False
def readResponse( self, command, args ):
"""Reads the POP3 server's response. Also sets self.isClosing to
True if the server closes the socket, which tells found_terminator()
to close when the response has been sent."""
isMulti = self.isMultiline( command, args )
responseLines = []
isFirstLine = True
while True:
line = self.serverFile.readline()
if not line:
# The socket has been closed by the server, probably by QUIT.
self.isClosing = True
break
elif not isMulti or ( isFirstLine and line.startswith( '-ERR' ) ):
# A single-line response.
responseLines.append( line )
break
elif line == '.\r\n':
# The termination line.
responseLines.append( line )
break
else:
# A normal line - append it to the response and carry on.
responseLines.append( line )
isFirstLine = False
return ''.join( responseLines )
def collect_incoming_data( self, data ):
"""Asynchat override."""
self.request = self.request + data
def found_terminator( self ):
"""Asynchat override."""
# Send the request to the server and read the reply.
# XXX When the response is huge, the email client can time out.
# It should read as much as it can from the server, then if the
# response is still coming after say 30 seconds, it should classify
# the message based on that and send back the headers and the body
# so far. Then it should become a simple one-packet-at-a-time proxy
# for the rest of the response.
if self.request.strip().upper() == 'KILL':
self.serverFile.write( 'QUIT\r\n' )
self.serverFile.flush()
self.send( "+OK, dying.\r\n" )
self.shutdown( 2 )
self.close()
raise SystemExit
self.serverFile.write( self.request + '\r\n' )
self.serverFile.flush()
if self.request.strip() == '':
# Someone just hit the Enter key.
command, args = ( '', '' )
else:
splitCommand = self.request.strip().split( None, 1 )
command = splitCommand[ 0 ].upper()
args = splitCommand[ 1: ]
rawResponse = self.readResponse( command, args )
# Pass the request/reply to the subclass and send back its response.
cookedResponse = self.onTransaction( command, args, rawResponse )
self.push( cookedResponse )
self.request = ''
# If readResponse() decided that the server had closed its socket,
# close this one when the response has been sent.
if self.isClosing:
self.close_when_done()
def handle_error( self ):
"""Let SystemExit cause an exit."""
type, v, t = sys.exc_info()
if type == SystemExit:
raise
else:
asynchat.async_chat.handle_error( self )
class BayesProxyListener( Listener ):
"""Listens for incoming email client connections and spins off
BayesProxy objects to serve them."""
def __init__( self, serverName, serverPort, proxyPort, bayes ):
proxyArgs = ( serverName, serverPort, bayes )
Listener.__init__( self, proxyPort, BayesProxy, proxyArgs )
class BayesProxy( POP3ProxyBase ):
"""Proxies between an email client and a POP3 server, inserting
X-Bayes-Score headers. It acts on the following POP3 commands:
o STAT:
o Adds the size of all the X-Bayes-Score headers to the maildrop
size.
o LIST:
o With no message number: adds the size of an X-Bayes-Score header
to the message size for each message in the scan listing.
o With a message number: adds the size of an X-Bayes-Score header
to the message size.
o RETR:
o Adds the X-Bayes-Score header based on the raw headers and body
of the message.
o TOP:
o Adds the X-Bayes-Score header based on the raw headers and as much
of the body as the TOP command retrieves. This can mean that the
header might have a different value for different calls to TOP, or
for calls to TOP vs. calls to RETR. I'm assuming that the email
client will either not make multiple calls, or will cope with the
headers being different.
"""
def __init__( self, clientSocket, serverName, serverPort, bayes ):
# Open the log file *before* calling __init__ for the base class,
# 'cos that might call send or recv.
self.bayes = bayes
self.logFile = open( '_pop3proxy.log', 'wb' )
POP3ProxyBase.__init__( self, clientSocket, serverName, serverPort )
self.handlers = { 'STAT': self.onStat, 'LIST': self.onList,
'RETR': self.onRetr, 'TOP': self.onTop }
def send( self, data ):
"""Logs the data to the log file."""
self.logFile.write( data )
self.logFile.flush()
return POP3ProxyBase.send( self, data )
def recv( self, size ):
"""Logs the data to the log file."""
data = POP3ProxyBase.recv( self, size )
self.logFile.write( data )
self.logFile.flush()
return data
def onTransaction( self, command, args, response ):
"""Takes the raw request and response, and returns the (possibly
processed) response to pass back to the email client."""
handler = self.handlers.get( command, self.onUnknown )
return handler( command, args, response )
def onStat( self, command, args, response ):
"""Adds the size of all the X-Bayes-Score headers to the maildrop
size."""
match = re.search( r'^\+OK\s+(\d+)\s+(\d+)(.*)\r\n', response )
if match:
count = int( match.group( 1 ) )
size = int( match.group( 2 ) ) + len( HEADER_EXAMPLE ) * count
return '+OK %d %d%s\r\n' % ( count, size, match.group( 3 ) )
else:
return response
def onList( self, command, args, response ):
"""Adds the size of an X-Bayes-Score header to the message
size(s)."""
if response.count( '\r\n' ) > 1:
# Multiline: all lines but the first contain a message size.
lines = response.split( '\r\n' )
outputLines = [ lines[ 0 ] ]
for line in lines[ 1: ]:
match = re.search( '^(\d+)\s+(\d+)', line )
if match:
number = int( match.group( 1 ) )
size = int( match.group( 2 ) ) + len( HEADER_EXAMPLE )
line = "%d %d" % ( number, size )
outputLines.append( line )
return '\r\n'.join( outputLines )
else:
# Single line.
match = re.search( '^\+OK\s+(\d+)(.*)\r\n', response )
if match:
size = int( match.group( 1 ) ) + len( HEADER_EXAMPLE )
return "+OK %d%s\r\n" % ( size, match.group( 2 ) )
else:
return response
def onRetr( self, command, args, response ):
"""Adds the X-Bayes-Score header based on the raw headers and body
of the message."""
# Use '\n\r?\n' to detect the end of the headers in case of broken
# emails that don't use the proper line separators.
if re.search( r'\n\r?\n', response ):
# Break off the first line, which will be '+OK'.
ok, message = response.split( '\n', 1 )
# Now find the spam probability and add the header.
prob = self.bayes.spamprob( tokenizer.tokenize( message ) )
headers, body = re.split( r'\n\r?\n', response, 1 )
headers = headers + '\r\n' + HEADER_FORMAT % prob + '\r\n'
return headers + body
else:
# Must be an error response.
return response
def onTop( self, command, args, response ):
"""Adds the X-Bayes-Score header based on the raw headers and as
much of the body as the TOP command retrieves."""
# Easy (but see the caveat in BayesProxy.__doc__).
return self.onRetr( command, args, response )
def onUnknown( self, command, args, response ):
"""Default handler - just returns the server's response verbatim."""
return response
def createBayes( pickleName=None, useDB=False ):
"""Create a GrahamBayes object to score the emails."""
bayes = None
if useDB:
bayes = hammie.PersistentGrahamBayes( pickleName )
elif pickleName:
try:
fp = open( pickleName, 'rb' )
except IOError, e:
if e.errno <> errno.ENOENT:
raise
else:
print "Loading database...",
bayes = cPickle.load( fp )
fp.close()
print "Done."
if bayes is None:
bayes = GrahamBayes()
return bayes
def main( serverName, serverPort, proxyPort, pickleName, useDB ):
"""Runs the proxy forever or until a 'KILL' command is received or
someone hits Ctrl+Break."""
bayes = createBayes( pickleName, useDB )
BayesProxyListener( serverName, serverPort, proxyPort, bayes )
asyncore.loop()
# ===================================================================
# Test code.
# ===================================================================
# One example of spam and one of ham - both are used to train, and are then
# classified. Not a good test of the classifier, but a perfectly good test
# of the POP3 proxy. The bodies of these came from the spambayes project,
# and I added the headers myself because the originals had no headers.
spam1 = """From: friend@public.com
Subject: Make money fast
Hello tim_chandler , Want to save money ?
Now is a good time to consider refinancing. Rates are low so you can cut
your current payments and save money.
http://64.251.22.101/interest/index%38%30%300%2E%68t%6D
Take off list on site [s5]
"""
good1 = """From: chris@example.com
Subject: ZPT and DTML
Jean Jordaan wrote:
> 'Fraid so ;> It contains a vintage dtml-calendar tag.
> http://www.zope.org/Members/teyc/CalendarTag
>
> Hmm I think I see what you mean: one needn't manually pass on the
> namespace to a ZPT?
Yeah, Page Templates are a bit more clever, sadly, DTML methods aren't :-(
Chris
"""
class TestListener( Listener ):
"""Listener for TestPOP3Server. Works on port 8110, to co-exist with
real POP3 servers."""
def __init__( self, socketMap=asyncore.socket_map ):
Listener.__init__( self, 8110, TestPOP3Server, socketMap=socketMap )
class TestPOP3Server( asynchat.async_chat ):
"""Minimal POP3 server, for testing purposes. Doesn't support TOP or
UIDL. USER, PASS, APOP, DELE and RSET simply return "+OK" without doing
anything. Also understands the 'KILL' command, to kill it. The mail
content is the example messages in classifier.py."""
def __init__( self, clientSocket, socketMap=asyncore.socket_map ):
# Grumble: asynchat.__init__ doesn't take a 'map' argument, hence
# the two-stage construction.
asynchat.async_chat.__init__( self )
asynchat.async_chat.set_socket( self, clientSocket, socketMap )
self.maildrop = [ spam1, good1 ]
self.set_terminator( '\r\n' )
self.okCommands = [ 'USER', 'PASS', 'APOP', 'NOOP',
'DELE', 'RSET', 'QUIT', 'KILL' ]
self.handlers = { 'STAT': self.onStat,
'LIST': self.onList,
'RETR': self.onRetr }
self.push( "+OK ready\r\n" )
self.request = ''
def handle_connect( self ):
"""Suppress the asyncore "unhandled connect event" warning."""
pass
def collect_incoming_data( self, data ):
"""Asynchat override."""
self.request = self.request + data
def found_terminator( self ):
"""Asynchat override."""
if ' ' in self.request:
command, args = self.request.split( None, 1 )
else:
command, args = self.request, ''
command = command.upper()
if command in self.okCommands:
self.push( "+OK (we hope)\r\n" )
if command == 'QUIT':
self.close_when_done()
if command == 'KILL':
raise SystemExit
else:
handler = self.handlers.get( command, self.onUnknown )
self.push( handler( command, args ) )
self.request = ''
def handle_error( self ):
"""Let SystemExit cause an exit."""
type, v, t = sys.exc_info()
if type == SystemExit:
raise
else:
asynchat.async_chat.handle_error( self )
def onStat( self, command, args ):
maildropSize = reduce( operator.add, map( len, self.maildrop ) )
maildropSize += len( self.maildrop ) * len( HEADER_EXAMPLE )
return "+OK %d %d\r\n" % ( len( self.maildrop ), maildropSize )
def onList( self, command, args ):
if args:
number = int( args )
if 0 < number <= len( self.maildrop ):
return "+OK %d\r\n" % len( self.maildrop[ number - 1 ] )
else:
return "-ERR no such message\r\n"
else:
returnLines = [ "+OK" ]
for messageIndex in range( len( self.maildrop ) ):
size = len( self.maildrop[ messageIndex ] )
returnLines.append( "%d %d" % ( messageIndex + 1, size ) )
returnLines.append( "." )
return '\r\n'.join( returnLines ) + '\r\n'
def onRetr( self, command, args ):
number = int( args )
if 0 < number <= len( self.maildrop ):
message = self.maildrop[ number - 1 ]
return "+OK\r\n%s\r\n.\r\n" % message
else:
return "-ERR no such message\r\n"
def onUnknown( self, command, args ):
return "-ERR Unknown command: '%s'\r\n" % command
def test():
"""Runs a self-test using TestPOP3Server, a minimal POP3 server that
serves the example emails above."""
# Run a proxy and a test server in separate threads with separate
# asyncore environments.
import threading
testServerReady = threading.Event()
def runTestServer():
testSocketMap = {}
TestListener( socketMap=testSocketMap )
testServerReady.set()
asyncore.loop( map=testSocketMap )
def runProxy():
bayes = createBayes()
BayesProxyListener( 'localhost', 8110, 8111, bayes )
bayes.learn( tokenizer.tokenize( spam1 ), True )
bayes.learn( tokenizer.tokenize( good1 ), False )
asyncore.loop()
threading.Thread( target=runTestServer ).start()
testServerReady.wait()
threading.Thread( target=runProxy ).start()
# Connect to the proxy.
proxy = socket.socket( socket.AF_INET, socket.SOCK_STREAM )
proxy.connect( ( 'localhost', 8111 ) )
assert proxy.recv( 100 ) == "+OK ready\r\n"
# Stat the mailbox to get the number of messages.
proxy.send( "stat\r\n" )
response = proxy.recv( 100 )
count, totalSize = map( int, response.split()[ 1:3 ] )
print "%d messages in test mailbox" % count
assert count == 2
# Loop through the messages ensuring that they have X-Bayes-Score
# headers.
for i in range( 1, count+1 ):
response = ""
proxy.send( "retr %d\r\n" % i )
while response.find( '\n.\r\n' ) == -1:
response = response + proxy.recv( 1000 )
headerOffset = response.find( 'X-Bayes-Score' )
assert headerOffset != -1
headerEnd = headerOffset + len( HEADER_EXAMPLE )
header = response[ headerOffset:headerEnd ].strip()
print "Message %d: %s" % ( i, header )
# Kill the proxy and the test server.
proxy.sendall( "kill\r\n" )
server = socket.socket( socket.AF_INET, socket.SOCK_STREAM )
server.connect( ( 'localhost', 8110 ) )
server.sendall( "kill\r\n" )
# ===================================================================
# __main__ driver.
# ===================================================================
if __name__ == '__main__':
# Read the arguments.
try:
opts, args = getopt.getopt( sys.argv[ 1: ], 'htdp:' )
except getopt.error, msg:
print >>sys.stderr, str( msg ) + '\n\n' + __doc__
sys.exit()
pickleName = hammie.DEFAULTDB
useDB = False
runTestServer = False
for opt, arg in opts:
if opt == '-h':
print >>sys.stderr, __doc__
sys.exit()
elif opt == '-t':
runTestServer = True
elif opt == '-d':
useDB = True
elif opt == '-p':
pickleName = arg
# Do whatever we've been asked to do...
if not opts and not args:
print "Running a self-test (use 'pop3proxy -h' for help)"
test()
print "Self-test passed." # ...else it would have asserted.
elif runTestServer:
print "Running a test POP3 server on port 8110..."
TestListener()
asyncore.loop()
elif len( args ) == 1:
# Named POP3 server, default port.
main( args[ 0 ], 110, 110, pickleName, useDB )
elif len( args ) == 2:
# Named POP3 server, named port.
main( args[ 0 ], int( args[ 1 ] ), 110, pickleName, useDB )
else:
print >>sys.stderr, __doc__
From montanaro@users.sourceforge.net Mon Sep 16 18:28:48 2002
From: montanaro@users.sourceforge.net (Skip Montanaro)
Date: Mon, 16 Sep 2002 10:28:48 -0700
Subject: [Spambayes-checkins] spambayes loosecksum.py,1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv25859
Modified Files:
loosecksum.py
Log Message:
fix typo
Index: loosecksum.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/loosecksum.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** loosecksum.py 9 Sep 2002 19:23:18 -0000 1.1
--- loosecksum.py 16 Sep 2002 17:28:45 -0000 1.2
***************
*** 79,84 ****
return flatten(obj.get_payload())
if isinstance(obj, list):
! return "\n".join([flatten(b) for b in body])
! raise TypeError, ("unrecognized body type: %s" % type(body))
def generate_checksum(f):
--- 79,84 ----
return flatten(obj.get_payload())
if isinstance(obj, list):
! return "\n".join([flatten(b) for b in obj])
! raise TypeError, ("unrecognized body type: %s" % type(obj))
def generate_checksum(f):
From rubiconx@users.sourceforge.net Tue Sep 17 05:49:18 2002
From: rubiconx@users.sourceforge.net (Neale Pickett)
Date: Mon, 16 Sep 2002 21:49:18 -0700
Subject: [Spambayes-checkins] spambayes runtest.sh,NONE,1.1
README.txt,1.18,1.19
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv26378
Modified Files:
README.txt
Added Files:
runtest.sh
Log Message:
Added the runtest.sh script, which is supposed to make it easier for
rubes like myself to submit useful test results.
--- NEW FILE: runtest.sh ---
#! /bin/sh -x
##
## runtest.sh -- run some tests for Tim
##
## This does everything you need to test yer data. You may want to skip
## the rebal steps if you've recently moved some of your messages
## (because they were in the wrong corpus) or you may suffer my fate and
## get stuck forever re-categorizing email.
##
## Just set up your messages as detailed in README.txt; put them all in
## the reservoir directories, and this script will take care of the
## rest. Paste the output (also in results.txt) to the mailing list for
## good karma.
##
## Neale Pickett
##
# Number of messages per rebalanced set
RNUM=200
# Number of sets
SETS=5
# Put them all into reservoirs
python rebal.py -r Data/Ham/reservoir -s Data/Ham/Set -n 0 -Q
python rebal.py -r Data/Spam/reservoir -s Data/Spam/Set -n 0 -Q
# Rebalance
python rebal.py -r Data/Ham/reservoir -s Data/Ham/Set -n $RNUM -Q
python rebal.py -r Data/Spam/reservoir -s Data/Spam/Set -n $RNUM -Q
# Clear out .ini file
rm -f bayescustomize.ini
# Run 1
python timcv.py -n $SETS > run1.txt
# New .ini file
cat > bayescustomize.ini < run2.txt
# Generate rates
python rates.py run1 run2 > runrates.txt
# Compare rates
python cmp.py run1s run2s | tee results.txt
Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.18
retrieving revision 1.19
diff -C2 -d -r1.18 -r1.19
*** README.txt 14 Sep 2002 22:18:24 -0000 1.18
--- README.txt 17 Sep 2002 04:49:16 -0000 1.19
***************
*** 124,127 ****
--- 124,134 ----
the number of messages per folder.
+ runtest.sh
+ A bourne shell script (for Unix) which will run some test or other.
+ I (Neale) will try to keep this updated to test whatever Tim is
+ currently asking for. The idea is, if you have a standard directory
+ structure (below), you can run this thing, go have some tea while it
+ works, then paste the output to the spambayes list for good karma.
+
Standard Test Data Setup
From jhylton@users.sourceforge.net Tue Sep 17 16:29:48 2002
From: jhylton@users.sourceforge.net (Jeremy Hylton)
Date: Tue, 17 Sep 2002 08:29:48 -0700
Subject: [Spambayes-checkins] spambayes Options.py,1.15,1.16
mboxtest.py,1.5,1.6
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv29116
Modified Files:
Options.py mboxtest.py
Log Message:
Add three options for MboxTest.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.15
retrieving revision 1.16
diff -C2 -d -r1.15 -r1.16
*** Options.py 15 Sep 2002 00:01:48 -0000 1.15
--- Options.py 17 Sep 2002 15:29:45 -0000 1.16
***************
*** 68,71 ****
--- 68,86 ----
mine_received_headers: False
+ [MboxTest]
+ # If tokenize_header_words is true, then the header values are
+ # tokenized using the default text tokenize. The words are tagged
+ # with "header:" where header is the name of the header.
+ tokenize_header_words: False
+ # If tokenize_header_default is True, use the base header tokenization
+ # logic described in the Tokenizer section.
+ tokenize_header_default: True
+
+ # skip_headers is a set of regular expressions describing headers that
+ # should not be tokenized if tokenize_header is True.
+ skip_headers: received
+ date
+ x-.*
+
[TestDriver]
# These control various displays in class TestDriver.Driver.
***************
*** 158,161 ****
--- 173,180 ----
'adjust_probs_by_evidence_mass': boolean_cracker,
},
+ 'MboxTest': {'tokenize_header_words': boolean_cracker,
+ 'tokenize_header_default': boolean_cracker,
+ 'skip_headers': ('get', lambda s: Set(s.split())),
+ },
}
Index: mboxtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/mboxtest.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** mboxtest.py 14 Sep 2002 00:03:51 -0000 1.5
--- mboxtest.py 17 Sep 2002 15:29:45 -0000 1.6
***************
*** 22,25 ****
--- 22,26 ----
import mailbox
import random
+ import re
from sets import Set
import sys
***************
*** 28,31 ****
--- 29,33 ----
from TestDriver import Driver
from timtest import Msg
+ from Options import options
mbox_fmts = {"unix": mailbox.PortableUnixMailbox,
***************
*** 37,53 ****
class MyTokenizer(Tokenizer):
! skip = {'received': 1,
! 'date': 1,
! 'x-from_': 1,
! }
def tokenize_headers(self, msg):
! for k, v in msg.items():
! k = k.lower()
! if k in self.skip or k.startswith('x-vm'):
! continue
! for w in subject_word_re.findall(v):
! for t in tokenize_word(w):
! yield "%s:%s" % (k, t)
class MboxMsg(Msg):
--- 39,57 ----
class MyTokenizer(Tokenizer):
! skip = [re.compile(rx) for rx in options.skip_headers]
def tokenize_headers(self, msg):
! if options.tokenize_header_words:
! for k, v in msg.items():
! k = k.lower()
! for rx in self.skip:
! if rx.match(k):
! continue
! for w in subject_word_re.findall(v):
! for t in tokenize_word(w):
! yield "%s:%s" % (k, t)
! if options.tokenize_header_default:
! for tok in Tokenizer.tokenize_headers(self, msg):
! yield tok
class MboxMsg(Msg):
***************
*** 74,81 ****
return "\n".join(lines)
! ## tokenize = MyTokenizer().tokenize
def __iter__(self):
! return tokenize(self.guts)
class mbox(object):
--- 78,85 ----
return "\n".join(lines)
! tokenize = MyTokenizer().tokenize
def __iter__(self):
! return self.tokenize(self.guts)
class mbox(object):
***************
*** 130,134 ****
FMT = "unix"
! NSETS = 5
SEED = 101
MAXMSGS = None
--- 134,138 ----
FMT = "unix"
! NSETS = 10
SEED = 101
MAXMSGS = None
***************
*** 158,176 ****
print "spam", spam, nspam
! testsets = []
! for iham in randindices(nham, NSETS):
! for ispam in randindices(nspam, NSETS):
! testsets.append((sort(iham), sort(ispam)))
driver = Driver()
! for iham, ispam in testsets:
! driver.new_classifier()
! driver.train(mbox(ham, iham), mbox(spam, ispam))
! for ihtest, istest in testsets:
! if (iham, ispam) == (ihtest, istest):
! continue
! driver.test(mbox(ham, ihtest), mbox(spam, istest))
driver.finishtest()
driver.alldone()
--- 162,188 ----
print "spam", spam, nspam
! ihams = map(tuple, randindices(nham, NSETS))
! ispams = map(tuple, randindices(nspam, NSETS))
driver = Driver()
! for i in range(1, NSETS):
! driver.train(mbox(ham, ihams[i]), mbox(spam, ispams[i]))
!
! i = 0
! for iham, ispam in zip(ihams, ispams):
! hams = mbox(ham, iham)
! spams = mbox(spam, ispam)
!
! if i > 0:
! driver.untrain(hams, spams)
!
! driver.test(hams, spams)
driver.finishtest()
+
+ if i < NSETS - 1:
+ driver.train(hams, spams)
+
+ i += 1
driver.alldone()
From jhylton@users.sourceforge.net Tue Sep 17 18:57:42 2002
From: jhylton@users.sourceforge.net (Jeremy Hylton)
Date: Tue, 17 Sep 2002 10:57:42 -0700
Subject: [Spambayes-checkins]
spambayes Options.py,1.16,1.17 mboxtest.py,1.6,1.7 tokenizer.py,1.22,1.23
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv22911
Modified Files:
Options.py mboxtest.py tokenizer.py
Log Message:
Merge the simple tokenizer from mboxtest.MyTokenizer into the default
tokenizer, controlled by the basic_header_tokenize options.
This gives good results for my ham/spam collection, cutting the number
of false negatives in half without changing the total number of false
positives.
false positive percentages
3.030 1.527 won -49.60%
0.758 3.053 lost +302.77%
3.030 1.527 won -49.60%
1.515 1.527 lost +0.79%
0.758 0.000 won -100.00%
1.515 2.290 lost +51.16%
1.515 1.527 lost +0.79%
3.030 2.290 won -24.42%
0.758 0.763 lost +0.66%
0.000 1.527 lost +(was 0)
won 4 times
tied 0 times
lost 6 times
total unique fp went from 21 to 21 tied
mean fp % went from 1.59090909091 to 1.60305343511 lost +0.76%
false negative percentages
4.511 4.511 tied
9.023 3.759 won -58.34%
8.271 3.759 won -54.55%
9.023 5.263 won -41.67%
7.519 2.256 won -70.00%
8.271 3.759 won -54.55%
9.774 4.511 won -53.85%
5.263 3.759 won -28.58%
4.511 3.759 won -16.67%
3.759 3.759 tied
won 8 times
tied 2 times
lost 0 times
total unique fn went from 93 to 52 won -44.09%
mean fn % went from 6.99248120301 to 3.90977443609 won -44.09%
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.16
retrieving revision 1.17
diff -C2 -d -r1.16 -r1.17
*** Options.py 17 Sep 2002 15:29:45 -0000 1.16
--- Options.py 17 Sep 2002 17:57:39 -0000 1.17
***************
*** 13,16 ****
--- 13,36 ----
defaults = """
[Tokenizer]
+ # If true, tokenizer.Tokenizer.tokenize_headers() will tokenize the
+ # contents of each header field just like the text of the message
+ # body, using the name of the header as a tag. Tokens look like
+ # "header:word". The basic approach is simple and effective, but also
+ # very sensitive to biases in the ham and spam collections. For
+ # example, if the ham and spam were collected at different times,
+ # several headers with date/time information will become the best
+ # discriminators. (Not just Date, but Received and X-From_.)
+ basic_header_tokenize: False
+
+ # If true and basic_header_tokenize is also true, then
+ # basic_header_tokenize is the only action performed.
+ basic_header_tokenize_only: False
+
+ # If basic_header_tokenize is true, then basic_header_skip is a set of
+ # headers that should be skipped.
+ basic_header_skip: received
+ date
+ x-.*
+
# If false, tokenizer.Tokenizer.tokenize_body() strips HTML tags
# from pure text/html messages. Set true to retain HTML tags in this
***************
*** 68,86 ****
mine_received_headers: False
- [MboxTest]
- # If tokenize_header_words is true, then the header values are
- # tokenized using the default text tokenize. The words are tagged
- # with "header:" where header is the name of the header.
- tokenize_header_words: False
- # If tokenize_header_default is True, use the base header tokenization
- # logic described in the Tokenizer section.
- tokenize_header_default: True
-
- # skip_headers is a set of regular expressions describing headers that
- # should not be tokenized if tokenize_header is True.
- skip_headers: received
- date
- x-.*
-
[TestDriver]
# These control various displays in class TestDriver.Driver.
--- 88,91 ----
***************
*** 151,154 ****
--- 156,162 ----
'count_all_header_lines': boolean_cracker,
'mine_received_headers': boolean_cracker,
+ 'basic_header_tokenize': boolean_cracker,
+ 'basic_header_tokenize_only': boolean_cracker,
+ 'basic_header_skip': ('get', lambda s: Set(s.split())),
},
'TestDriver': {'nbuckets': int_cracker,
***************
*** 173,180 ****
'adjust_probs_by_evidence_mass': boolean_cracker,
},
- 'MboxTest': {'tokenize_header_words': boolean_cracker,
- 'tokenize_header_default': boolean_cracker,
- 'skip_headers': ('get', lambda s: Set(s.split())),
- },
}
--- 181,184 ----
***************
*** 222,226 ****
return output.getvalue()
-
options = OptionsClass()
--- 226,229 ----
***************
*** 230,231 ****
--- 233,235 ----
options.mergefiles(['bayescustomize.ini'])
+
Index: mboxtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/mboxtest.py,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** mboxtest.py 17 Sep 2002 15:29:45 -0000 1.6
--- mboxtest.py 17 Sep 2002 17:57:39 -0000 1.7
***************
*** 26,30 ****
import sys
! from tokenizer import Tokenizer, subject_word_re, tokenize_word, tokenize
from TestDriver import Driver
from timtest import Msg
--- 26,30 ----
import sys
! from tokenizer import tokenize
from TestDriver import Driver
from timtest import Msg
***************
*** 37,58 ****
}
- class MyTokenizer(Tokenizer):
-
- skip = [re.compile(rx) for rx in options.skip_headers]
-
- def tokenize_headers(self, msg):
- if options.tokenize_header_words:
- for k, v in msg.items():
- k = k.lower()
- for rx in self.skip:
- if rx.match(k):
- continue
- for w in subject_word_re.findall(v):
- for t in tokenize_word(w):
- yield "%s:%s" % (k, t)
- if options.tokenize_header_default:
- for tok in Tokenizer.tokenize_headers(self, msg):
- yield tok
-
class MboxMsg(Msg):
--- 37,40 ----
***************
*** 78,85 ****
return "\n".join(lines)
- tokenize = MyTokenizer().tokenize
-
def __iter__(self):
! return self.tokenize(self.guts)
class mbox(object):
--- 60,65 ----
return "\n".join(lines)
def __iter__(self):
! return tokenize(self.guts)
class mbox(object):
***************
*** 132,135 ****
--- 112,117 ----
def main(args):
global FMT
+
+ print options.display()
FMT = "unix"
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.22
retrieving revision 1.23
diff -C2 -d -r1.22 -r1.23
*** tokenizer.py 14 Sep 2002 20:08:07 -0000 1.22
--- tokenizer.py 17 Sep 2002 17:57:39 -0000 1.23
***************
*** 829,832 ****
--- 829,837 ----
class Tokenizer:
+ def __init__(self):
+ if options.basic_header_tokenize:
+ self.basic_skip = [re.compile(s)
+ for s in options.basic_header_skip]
+
def get_message(self, obj):
if isinstance(obj, email.Message.Message):
***************
*** 857,860 ****
--- 862,890 ----
# Special tagging of header lines.
+ # Basic header tokenization
+ # Tokenize the contents of each header field just like the
+ # text of the message body, using the name of the header as a
+ # tag. Tokens look like "header:word". The basic approach is
+ # simple and effective, but also very sensitive to biases in
+ # the ham and spam collections. For example, if the ham and
+ # spam were collected at different times, several headers with
+ # date/time information will become the best discriminators.
+ # (Not just Date, but Received and X-From_.)
+ if options.basic_header_tokenize:
+ for k, v in msg.items():
+ k = k.lower()
+ match = False
+ for rx in self.basic_skip:
+ if rx.match(k) is not None:
+ match = True
+ continue
+ if match:
+ continue
+ for w in subject_word_re.findall(v):
+ for t in tokenize_word(w):
+ yield "%s:%s" % (k, t)
+ if options.basic_header_tokenize_only:
+ return
+
# XXX TODO Neil Schemenauer has gotten a good start on this
# XXX (pvt email). The headers in my spam and ham corpora are
***************
*** 863,868 ****
# XXX some "safe" header lines are included here, where "safe"
# XXX is specific to my sorry corpora.
- # XXX Jeremy Hylton also reported good results from the general
- # XXX header-mining in mboxtest.MyTokenizer.tokenize_headers.
# Content-{Type, Disposition} and their params, and charsets.
--- 893,896 ----
From tim_one@users.sourceforge.net Wed Sep 18 02:42:00 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Tue, 17 Sep 2002 18:42:00 -0700
Subject: [Spambayes-checkins]
spambayes Options.py,1.17,1.18 classifier.py,1.11,1.12
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv8672
Modified Files:
Options.py classifier.py
Log Message:
adjust_probs_by_evidence_mass is history -- the reported results weren't
strong and consistent enough to justify keeping it.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.17
retrieving revision 1.18
diff -C2 -d -r1.17 -r1.18
*** Options.py 17 Sep 2002 17:57:39 -0000 1.17
--- Options.py 18 Sep 2002 01:41:58 -0000 1.18
***************
*** 139,146 ****
max_discriminators: 16
-
- # Speculative change to allow giving probabilities more weight the more
- # messages went into computing them.
- adjust_probs_by_evidence_mass: False
"""
--- 139,142 ----
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.11
retrieving revision 1.12
diff -C2 -d -r1.11 -r1.12
*** classifier.py 15 Sep 2002 07:45:31 -0000 1.11
--- classifier.py 18 Sep 2002 01:41:58 -0000 1.12
***************
*** 547,551 ****
nham = float(self.nham or 1)
nspam = float(self.nspam or 1)
- fiddle = options.adjust_probs_by_evidence_mass
for word,record in self.wordinfo.iteritems():
# Compute prob(msg is spam | msg contains word).
--- 547,550 ----
***************
*** 560,580 ****
elif prob > MAX_SPAMPROB:
prob = MAX_SPAMPROB
-
- if fiddle:
- # Suppose two clues have spamprob 0.99. Which one is better?
- # One reasonable guess is that it's the one derived from the
- # most data. This code fiddles non-0.5 probabilities by
- # shrinking their distance to 0.5, but shrinking less the
- # more evidence went into computing them. Note that if this
- # proves to work, it should allow getting rid of the
- # "cancelling evidence" complications in spamprob()
- # (two probs exactly the same distance from 0.5 are far
- # less common after this transformation; instead, spamprob()
- # will pick up on the clues with the most evidence backing
- # them up).
- dist = prob - 0.5
- sum = hamcount + spamcount
- dist *= sum / (sum + 0.1)
- prob = 0.5 + dist
if record.spamprob != prob:
--- 559,562 ----
From rubiconx@users.sourceforge.net Wed Sep 18 18:44:27 2002
From: rubiconx@users.sourceforge.net (Neale Pickett)
Date: Wed, 18 Sep 2002 10:44:27 -0700
Subject: [Spambayes-checkins] spambayes runtest.sh,1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv22127
Modified Files:
runtest.sh
Log Message:
* Modified runtest.sh for Tim's request to test Robinson's changes.
Index: runtest.sh
===================================================================
RCS file: /cvsroot/spambayes/spambayes/runtest.sh,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** runtest.sh 17 Sep 2002 04:49:16 -0000 1.1
--- runtest.sh 18 Sep 2002 17:44:25 -0000 1.2
***************
*** 16,19 ****
--- 16,22 ----
##
+ # Test to run
+ TEST=${1:-robinson1}
+
# Number of messages per rebalanced set
RNUM=200
***************
*** 28,37 ****
python rebal.py -r Data/Ham/reservoir -s Data/Ham/Set -n $RNUM -Q
python rebal.py -r Data/Spam/reservoir -s Data/Spam/Set -n $RNUM -Q
! # Clear out .ini file
! rm -f bayescustomize.ini
! # Run 1
! python timcv.py -n $SETS > run1.txt
! # New .ini file
! cat > bayescustomize.ini <
!
! python timcv.py -n $SETS > run1.txt
!
! mv Tester.py Tester.py.orig
! cp Tester.py.new Tester.py
! mv classifier.py classifier.py.orig
! cp classifier.py.new classifier.py
! python timcv.py -n $SETS > run2.txt
!
! python rates.py run1 run2 > runrates.txt
!
! python cmp.py run1s run2s | tee results.txt
!
! mv Tester.py.orig Tester.py
! mv classifier.py.orig classifier.py
! ;;
! mass)
! ## Tim took this code out, don't run this test. I'm leaving
! ## this stuff in here for the time being so I can refer to it
! ## later when I need to do this sort of thing again :)
!
! # Clear out .ini file
! rm -f bayescustomize.ini
! # Run 1
! python timcv.py -n $SETS > run1.txt
! # New .ini file
! cat > bayescustomize.ini < run2.txt
! # Generate rates
! python rates.py run1 run2 > runrates.txt
! # Compare rates
! python cmp.py run1s run2s | tee results.txt
--- 70,79 ----
hambias: 1.5
EOF
! # Run 2
! python timcv.py -n $SETS > run2.txt
! # Generate rates
! python rates.py run1 run2 > runrates.txt
! # Compare rates
! python cmp.py run1s run2s | tee results.txt
! ;;
! esac
From richiehindle@users.sourceforge.net Wed Sep 18 23:01:42 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Wed, 18 Sep 2002 15:01:42 -0700
Subject: [Spambayes-checkins]
spambayes README.txt,1.19,1.20 pop3proxy.py,1.1,1.2 hammie.py,1.16,1.17
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv20824
Modified Files:
README.txt pop3proxy.py hammie.py
Log Message:
Added SPAM_THRESHOLD and createbayes() to hammie, so
that pop3proxy can use them.
Made pop3proxy add simple X-Hammie-Disposition headers
raher than using its own header format.
Made pop3proxy.py obey the Python style guide.
Removed the copyright and license from pop3proxy,py - I've
assigned copyright to the PSF.
Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.19
retrieving revision 1.20
diff -C2 -d -r1.19 -r1.20
*** README.txt 17 Sep 2002 04:49:16 -0000 1.19
--- README.txt 18 Sep 2002 22:01:39 -0000 1.20
***************
*** 60,63 ****
--- 60,69 ----
Needs to be made faster, especially for writes.
+ pop3proxy.py
+ A spam-classifying POP3 proxy. It adds a spam-judgement header to
+ each mail as it's retrieved, so you can use your email client's
+ filters to deal with them without needing to fiddle with your email
+ delivery system.
+
Concrete Test Drivers
Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** pop3proxy.py 16 Sep 2002 07:57:20 -0000 1.1
--- pop3proxy.py 18 Sep 2002 22:01:39 -0000 1.2
***************
*** 1,31 ****
#!/usr/bin/env python
! # pop3proxy is released under the terms of the following MIT-style license:
! #
! # Copyright (c) Entrian Solutions 2002
! #
! # Permission is hereby granted, free of charge, to any person obtaining a
! # copy of this software and associated documentation files (the "Software"),
! # to deal in the Software without restriction, including without limitation
! # the rights to use, copy, modify, merge, publish, distribute, sublicense,
[...1035 lines suppressed...]
# Named POP3 server, default port.
! main( args[ 0 ], 110, 110, pickleName, useDB )
! elif len( args ) == 2:
# Named POP3 server, named port.
! main( args[ 0 ], int( args[ 1 ] ), 110, pickleName, useDB )
else:
--- 571,581 ----
asyncore.loop()
! elif len(args) == 1:
# Named POP3 server, default port.
! main(args[0], 110, 110, pickleName, useDB)
! elif len(args) == 2:
# Named POP3 server, named port.
! main(args[0], int(args[1]), 110, pickleName, useDB)
else:
Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.16
retrieving revision 1.17
diff -C2 -d -r1.16 -r1.17
*** hammie.py 12 Sep 2002 05:10:02 -0000 1.16
--- hammie.py 18 Sep 2002 22:01:39 -0000 1.17
***************
*** 47,50 ****
--- 47,53 ----
DEFAULTDB = "hammie.db"
+ # Probability at which a message is considered spam
+ SPAM_THRESHOLD = 0.9
+
# Tim's tokenizer kicks far more booty than anything I would have
# written. Score one for analysis ;)
***************
*** 232,236 ****
msg = email.message_from_file(input)
prob, clues = bayes.spamprob(tokenize(msg), True)
! if prob < 0.9:
disp = "No"
else:
--- 235,239 ----
msg = email.message_from_file(input)
prob, clues = bayes.spamprob(tokenize(msg), True)
! if prob < SPAM_THRESHOLD:
disp = "No"
else:
***************
*** 250,254 ****
i += 1
prob, clues = bayes.spamprob(tokenize(msg), True)
! isspam = prob >= 0.9
if hasattr(msg, '_mh_msgno'):
msgno = msg._mh_msgno
--- 253,257 ----
i += 1
prob, clues = bayes.spamprob(tokenize(msg), True)
! isspam = prob >= SPAM_THRESHOLD
if hasattr(msg, '_mh_msgno'):
msgno = msg._mh_msgno
***************
*** 263,266 ****
--- 266,288 ----
print "Total %d spam, %d ham" % (spams, hams)
+ def createbayes(pck=DEFAULTDB, usedb=False):
+ """Create a GrahamBayes instance for the given pickle (which
+ doesn't have to exist). Create a PersistentGrahamBayes if
+ usedb is True."""
+ if usedb:
+ bayes = PersistentGrahamBayes(pck)
+ else:
+ bayes = None
+ try:
+ fp = open(pck, 'rb')
+ except IOError, e:
+ if e.errno <> errno.ENOENT: raise
+ else:
+ bayes = pickle.load(fp)
+ fp.close()
+ if bayes is None:
+ bayes = classifier.GrahamBayes()
+ return bayes
+
def usage(code, msg=''):
"""Print usage message and sys.exit(code)."""
***************
*** 304,320 ****
save = False
! if usedb:
! bayes = PersistentGrahamBayes(pck)
! else:
! bayes = None
! try:
! fp = open(pck, 'rb')
! except IOError, e:
! if e.errno <> errno.ENOENT: raise
! else:
! bayes = pickle.load(fp)
! fp.close()
! if bayes is None:
! bayes = classifier.GrahamBayes()
if good:
--- 326,330 ----
save = False
! bayes = createbayes(pck, usedb)
if good:
From rubiconx@users.sourceforge.net Thu Sep 19 01:17:43 2002
From: rubiconx@users.sourceforge.net (Neale Pickett)
Date: Wed, 18 Sep 2002 17:17:43 -0700
Subject: [Spambayes-checkins] spambayes hammiesrv.py,NONE,1.1
runtest.sh,1.2,1.3
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv26022
Modified Files:
runtest.sh
Added Files:
hammiesrv.py
Log Message:
* runtest now supports targets and a -r option to force re-rebal-ing
* new hammiesrv, with a nice clean Hammie class I probably ought to
move to hammie.py before anything new code imports it
--- NEW FILE: hammiesrv.py ---
#! /usr/bin/env python
# A server version of hammie.py
# Server code
import SimpleXMLRPCServer
import email
import hammie
from tokenizer import tokenize
# Default header to add
DFL_HEADER = "X-Hammie-Disposition"
# Default spam cutoff
DFL_CUTOFF = 0.9
class Hammie:
def __init__(self, bayes):
self.bayes = bayes
def _scoremsg(self, msg, evidence=False):
"""Score an email.Message.
Returns the probability the message is spam. If evidence is
true, returns a tuple: (probability, clues), where clues is a
list of the words which contributed to the score.
"""
return self.bayes.spamprob(tokenize(msg), evidence)
def score(self, msg, evidence=False):
"""Score (judge) a message.
Pass in a message as a string.
Returns the probability the message is spam. If evidence is
true, returns a tuple: (probability, clues), where clues is a
list of the words which contributed to the score.
"""
return self._scoremsg(email.message_from_string(msg), evidence)
def filter(self, msg, header=DFL_HEADER, cutoff=DFL_CUTOFF):
"""Score (judge) a message and add a disposition header.
Pass in a message as a string. Optionally, set header to the
name of the header to add, and/or cutoff to the probability
value which must be met or exceeded for a message to get a 'Yes'
disposition.
Returns the same message with a new disposition header.
"""
msg = email.message_from_string(msg)
prob, clues = self._scoremsg(msg, True)
if prob < cutoff:
disp = "No"
else:
disp = "Yes"
disp += "; %.2f" % prob
disp += "; " + hammie.formatclues(clues)
msg.add_header(header, disp)
return msg.as_string(unixfrom=(msg.get_unixfrom() is not None))
def train(self, msg, is_spam):
"""Train bayes with a message.
msg should be the message as a string, and is_spam should be 1
if the message is spam, 0 if not.
Probabilities are not updated after this call is made; to do
that, call update_probabilities().
"""
self.bayes.learn(tokenize(msg), is_spam, False)
def train_ham(self, msg):
"""Train bayes with ham.
msg should be the message as a string.
Probabilities are not updated after this call is made; to do
that, call update_probabilities().
"""
self.train(msg, False)
def train_spam(self, msg):
"""Train bayes with spam.
msg should be the message as a string.
Probabilities are not updated after this call is made; to do
that, call update_probabilities().
"""
self.train(msg, True)
def update_probabilities(self):
"""Update probability values.
You would want to call this after a training session. It's
pretty slow, so if you have a lot of messages to train, wait
until you're all done before calling this.
"""
self.bayes.update_probabilites()
def main():
usedb = True
pck = "/home/neale/lib/hammie.db"
if usedb:
bayes = hammie.PersistentGrahamBayes(pck)
else:
bayes = None
try:
fp = open(pck, 'rb')
except IOError, e:
if e.errno <> errno.ENOENT: raise
else:
bayes = pickle.load(fp)
fp.close()
if bayes is None:
import classifier
bayes = classifier.GrahamBayes()
server = SimpleXMLRPCServer.SimpleXMLRPCServer(("localhost", 7732))
server.register_instance(Hammie(bayes))
server.serve_forever()
if __name__ == "__main__":
main()
Index: runtest.sh
===================================================================
RCS file: /cvsroot/spambayes/spambayes/runtest.sh,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** runtest.sh 18 Sep 2002 17:44:25 -0000 1.2
--- runtest.sh 19 Sep 2002 00:17:41 -0000 1.3
***************
*** 16,20 ****
##
! # Test to run
TEST=${1:-robinson1}
--- 16,25 ----
##
! if [ "$1" = "-r" ]; then
! REBAL=1
! shift
! fi
!
! # Which test to run
TEST=${1:-robinson1}
***************
*** 25,36 ****
SETS=5
! # Put them all into reservoirs
! python rebal.py -r Data/Ham/reservoir -s Data/Ham/Set -n 0 -Q
! python rebal.py -r Data/Spam/reservoir -s Data/Spam/Set -n 0 -Q
! # Rebalance
! python rebal.py -r Data/Ham/reservoir -s Data/Ham/Set -n $RNUM -Q
! python rebal.py -r Data/Spam/reservoir -s Data/Spam/Set -n $RNUM -Q
case "$TEST" in
robinson1)
# This test requires you have an appropriately-modified
--- 30,50 ----
SETS=5
! if [ -n "$REBAL" ]; then
! # Put them all into reservoirs
! python rebal.py -r Data/Ham/reservoir -s Data/Ham/Set -n 0 -Q
! python rebal.py -r Data/Spam/reservoir -s Data/Spam/Set -n 0 -Q
! # Rebalance
! python rebal.py -r Data/Ham/reservoir -s Data/Ham/Set -n $RNUM -Q
! python rebal.py -r Data/Spam/reservoir -s Data/Spam/Set -n $RNUM -Q
! fi
case "$TEST" in
+ run2|useold)
+ python timcv.py -n $SETS > run2.txt
+
+ python rates.py run1 run2 > runrates.txt
+
+ python cmp.py run1s run2s | tee results.txt
+ ;;
robinson1)
# This test requires you have an appropriately-modified
From tim_one@users.sourceforge.net Thu Sep 19 07:30:18 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Wed, 18 Sep 2002 23:30:18 -0700
Subject: [Spambayes-checkins]
spambayes Options.py,1.18,1.19 Tester.py,1.3,1.4 classifier.py,1.12,1.13
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv25375
Modified Files:
Options.py Tester.py classifier.py
Log Message:
Making it easy to try Gary Robinson's probability combining scheme. Just
set:
[Classifier]
use_robinson_probability: True
[TestDriver]
spam_cutoff: 0.50
as a pair.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.18
retrieving revision 1.19
diff -C2 -d -r1.18 -r1.19
*** Options.py 18 Sep 2002 01:41:58 -0000 1.18
--- Options.py 19 Sep 2002 06:30:15 -0000 1.19
***************
*** 89,93 ****
[TestDriver]
! # These control various displays in class TestDriver.Driver.
# Number of buckets in histograms.
--- 89,103 ----
[TestDriver]
! # These control various displays in class TestDriver.Driver, and Tester.Test.
!
! # A message is considered spam iff it scores greater than spam_cutoff.
! # If using Graham's combining scheme, 0.90 seems to work best for "small"
! # training sets. As the size of the training sets increase, there's not
! # yet any bound in sight for how low this can go (0.075 would work as
! # well as 0.90 on Tim's large c.l.py data).
! # For Gary Robinson's scheme, 0.50 works best for *us*. Other people
! # who have implemented Graham's scheme, and stuck to it in most respects,
! # report values closer to 0.70 working best for them.
! spam_cutoff: 0.90
# Number of buckets in histograms.
***************
*** 139,142 ****
--- 149,155 ----
max_discriminators: 16
+
+ # Use Gary Robinson's scheme for combining probabilities.
+ use_robinson_probability: False
"""
***************
*** 168,171 ****
--- 181,185 ----
'pickle_basename': string_cracker,
'show_charlimit': int_cracker,
+ 'spam_cutoff': float_cracker,
},
'Classifier': {'hambias': float_cracker,
***************
*** 175,179 ****
'unknown_spamprob': float_cracker,
'max_discriminators': int_cracker,
! 'adjust_probs_by_evidence_mass': boolean_cracker,
},
}
--- 189,193 ----
'unknown_spamprob': float_cracker,
'max_discriminators': int_cracker,
! 'use_robinson_probability': boolean_cracker,
},
}
Index: Tester.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Tester.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** Tester.py 13 Sep 2002 17:49:02 -0000 1.3
--- Tester.py 19 Sep 2002 06:30:15 -0000 1.4
***************
*** 1,2 ****
--- 1,4 ----
+ from Options import options
+
class Test:
# Pass a classifier instance (an instance of GrahamBayes).
***************
*** 83,87 ****
if callback:
callback(example, prob)
! is_spam_guessed = prob > 0.90
correct = is_spam_guessed == is_spam
if is_spam:
--- 85,89 ----
if callback:
callback(example, prob)
! is_spam_guessed = prob > options.spam_cutoff
correct = is_spam_guessed == is_spam
if is_spam:
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.12
retrieving revision 1.13
diff -C2 -d -r1.12 -r1.13
*** classifier.py 18 Sep 2002 01:41:58 -0000 1.12
--- classifier.py 19 Sep 2002 06:30:15 -0000 1.13
***************
*** 314,329 ****
heapreplace(nbest, x)
! prob_product = inverse_prob_product = 1.0
! for distance, prob, word, record in nbest:
! if prob is None: # it's one of the dummies nbest started with
! continue
! if record is not None: # else wordinfo doesn't know about it
! record.killcount += 1
! if evidence:
! clues.append((word, prob))
! prob_product *= prob
! inverse_prob_product *= 1.0 - prob
- prob = prob_product / (prob_product + inverse_prob_product)
if evidence:
clues.sort(lambda a, b: cmp(a[1], b[1]))
--- 314,358 ----
heapreplace(nbest, x)
! if options.use_robinson_probability:
! # This combination method is due to Gary Robinson.
! # http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
! # In preliminary tests, it did just as well as Graham's scheme,
! # but creates a definite "middle ground" around 0.5 where false
! # negatives and false positives can actually found in non-trivial
! # number.
! P = Q = 1.0
! num_clues = 0
! for distance, prob, word, record in nbest:
! if prob is None: # it's one of the dummies nbest started with
! continue
! if record is not None: # else wordinfo doesn't know about it
! record.killcount += 1
! if evidence:
! clues.append((word, prob))
! num_clues += 1
! P *= 1.0 - prob
! Q *= prob
!
! if num_clues:
! P = 1.0 - P**(1./num_clues)
! Q = 1.0 - Q**(1./num_clues)
! prob = (P-Q)/(P+Q) # in -1 .. 1
! prob = 0.5 + prob/2 # shift to 0 .. 1
! else:
! prob = 0.5
! else:
! prob_product = inverse_prob_product = 1.0
! for distance, prob, word, record in nbest:
! if prob is None: # it's one of the dummies nbest started with
! continue
! if record is not None: # else wordinfo doesn't know about it
! record.killcount += 1
! if evidence:
! clues.append((word, prob))
! prob_product *= prob
! inverse_prob_product *= 1.0 - prob
!
! prob = prob_product / (prob_product + inverse_prob_product)
if evidence:
clues.sort(lambda a, b: cmp(a[1], b[1]))
From anthonybaxter@users.sourceforge.net Thu Sep 19 09:58:00 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Thu, 19 Sep 2002 01:58:00 -0700
Subject: [Spambayes-checkins] website developer.ht,1.1.1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/website
In directory usw-pr-cvs1:/tmp/cvs-serv7560
Modified Files:
developer.ht
Log Message:
duh.
Index: developer.ht
===================================================================
RCS file: /cvsroot/spambayes/website/developer.ht,v
retrieving revision 1.1.1.1
retrieving revision 1.2
diff -C2 -d -r1.1.1.1 -r1.2
*** developer.ht 19 Sep 2002 08:40:55 -0000 1.1.1.1
--- developer.ht 19 Sep 2002 08:57:58 -0000 1.2
***************
*** 25,29 ****
or even most cases.
There's a bunch of documentation on things that have already been tried
! available as links from the documentation page.
Collecting training data
--- 25,29 ----
or even most cases.
There's a bunch of documentation on things that have already been tried
! available as links from the documentation page.
Collecting training data
From anthonybaxter@users.sourceforge.net Thu Sep 19 10:34:59 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Thu, 19 Sep 2002 02:34:59 -0700
Subject: [Spambayes-checkins] spambayes Options.py,1.19,1.20
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv20143
Modified Files:
Options.py
Log Message:
if it exists, load options from file(s) specified in env var BAYESCUSTOMIZE
rather than bayescustomize.ini. Much more convenient.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.19
retrieving revision 1.20
diff -C2 -d -r1.19 -r1.20
*** Options.py 19 Sep 2002 06:30:15 -0000 1.19
--- Options.py 19 Sep 2002 09:34:56 -0000 1.20
***************
*** 4,8 ****
# XXX and must not conflict with OptionsClass method names.
! import sys
import StringIO
import ConfigParser
--- 4,8 ----
# XXX and must not conflict with OptionsClass method names.
! import sys, os
import StringIO
import ConfigParser
***************
*** 242,245 ****
del d
! options.mergefiles(['bayescustomize.ini'])
--- 242,249 ----
del d
! alternate = os.getenv('BAYESCUSTOMIZE')
! if alternate:
! options.mergefiles(alternate.split())
! else:
! options.mergefiles(['bayescustomize.ini'])
From anthonybaxter@users.sourceforge.net Thu Sep 19 11:25:33 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Thu, 19 Sep 2002 03:25:33 -0700
Subject: [Spambayes-checkins] spambayes cmp.py,1.8,1.9
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv3207
Modified Files:
cmp.py
Log Message:
I got sick of filename completion resulting in 'no such file foo.txt.txt',
so cmp.py now looks for the provided filename if "filename".txt doesn't
exist.
Index: cmp.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/cmp.py,v
retrieving revision 1.8
retrieving revision 1.9
diff -C2 -d -r1.8 -r1.9
*** cmp.py 14 Sep 2002 00:03:51 -0000 1.8
--- cmp.py 19 Sep 2002 10:25:31 -0000 1.9
***************
*** 70,78 ****
print
print f1n, '->', f2n
! fp1, fn1, fptot1, fntot1, fpmean1, fnmean1 = suck(file(f1n + '.txt'))
! fp2, fn2, fptot2, fntot2, fpmean2, fnmean2 = suck(file(f2n + '.txt'))
print
--- 70,88 ----
print
+ def windowsfy(fn):
+ import os
+ if os.path.exists(fn + '.txt'):
+ return fn + '.txt'
+ else:
+ return fn
print f1n, '->', f2n
!
! f1n = windowsfy(f1n)
! f2n = windowsfy(f2n)
!
! fp1, fn1, fptot1, fntot1, fpmean1, fnmean1 = suck(file(f1n))
! fp2, fn2, fptot2, fntot2, fpmean2, fnmean2 = suck(file(f2n))
print
From rubiconx@users.sourceforge.net Thu Sep 19 19:15:25 2002
From: rubiconx@users.sourceforge.net (Neale Pickett)
Date: Thu, 19 Sep 2002 11:15:25 -0700
Subject: [Spambayes-checkins] website/pics - New directory
Message-ID:
Update of /cvsroot/spambayes/website/pics
In directory usw-pr-cvs1:/tmp/cvs-serv553/pics
Log Message:
Directory /cvsroot/spambayes/website/pics added to the repository
From rubiconx@users.sourceforge.net Thu Sep 19 19:16:13 2002
From: rubiconx@users.sourceforge.net (Neale Pickett)
Date: Thu, 19 Sep 2002 11:16:13 -0700
Subject: [Spambayes-checkins] website/pics banner.png,NONE,1.1
Message-ID:
Update of /cvsroot/spambayes/website/pics
In directory usw-pr-cvs1:/tmp/cvs-serv707/pics
Added Files:
banner.png
Log Message:
* Fixed the little picture in the corner.
--- NEW FILE: banner.png ---
(This appears to be a binary file; contents omitted.)
From rubiconx@users.sourceforge.net Thu Sep 19 19:16:13 2002
From: rubiconx@users.sourceforge.net (Neale Pickett)
Date: Thu, 19 Sep 2002 11:16:13 -0700
Subject: [Spambayes-checkins]
website/scripts/ht2html SpamBayesGenerator.py,1.1.1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/website/scripts/ht2html
In directory usw-pr-cvs1:/tmp/cvs-serv707/scripts/ht2html
Modified Files:
SpamBayesGenerator.py
Log Message:
* Fixed the little picture in the corner.
Index: SpamBayesGenerator.py
===================================================================
RCS file: /cvsroot/spambayes/website/scripts/ht2html/SpamBayesGenerator.py,v
retrieving revision 1.1.1.1
retrieving revision 1.2
diff -C2 -d -r1.1.1.1 -r1.2
*** SpamBayesGenerator.py 19 Sep 2002 08:40:56 -0000 1.1.1.1
--- SpamBayesGenerator.py 19 Sep 2002 18:16:11 -0000 1.2
***************
*** 1,2 ****
--- 1,3 ----
+ #! /usr/bin/env python
"""Generates the www.python.org website style
"""
***************
*** 50,60 ****
sitelink_fixer.massage(sitelinks, self.__d, aboves=1)
Banner.__init__(self, sitelinks)
- # calculate the random corner
- # XXX Should really do a list of the pics directory...
- NBANNERS = 64
- i = whrandom.randint(0, NBANNERS-1)
- s = "PyBanner%03d.gif" % i
- self.__d['banner'] = s
- self.__whichbanner = i
def get_meta(self):
--- 51,54 ----
***************
*** 99,126 ****
return '''
!
! ''' % \
self.__d
def get_corner_bgcolor(self):
! # this may not be 100% correct. it uses PIL to get the RGB values at
! # the corners of the image and then takes a vote as to the most likely
! # value. Some images may be `bizarre'. See .../pics/backgrounds.py
! return [
! '#3399ff', '#6699cc', '#3399ff', '#0066cc', '#3399ff',
! '#0066cc', '#0066cc', '#3399ff', '#3399ff', '#3399ff',
! '#3399ff', '#6699cc', '#3399ff', '#3399ff', '#ffffff',
! '#6699cc', '#0066cc', '#3399ff', '#0066cc', '#3399ff',
! '#6699cc', '#0066cc', '#6699cc', '#3399ff', '#3399ff',
! '#6699cc', '#3399ff', '#3399ff', '#6699cc', '#6699cc',
! '#0066cc', '#6699cc', '#0066cc', '#6699cc', '#0066cc',
! '#0066cc', '#6699cc', '#3399ff', '#0066cc', '#bbd6f1',
! '#0066cc', '#6699cc', '#3399ff', '#3399ff', '#0066cc',
! '#0066cc', '#0066cc', '#6699cc', '#6699cc', '#3399ff',
! '#3399ff', '#6699cc', '#0066cc', '#0066cc', '#6699cc',
! '#0066cc', '#6699cc', '#3399ff', '#6699cc', '#3399ff',
! '#d6ebff', '#6699cc', '#3399ff', '#0066cc',
! ][self.__whichbanner]
def get_body(self):
--- 93,102 ----
return '''
!
! ''' % \
self.__d
def get_corner_bgcolor(self):
! return "#ffffff"
def get_body(self):
From rubiconx@users.sourceforge.net Thu Sep 19 23:10:09 2002
From: rubiconx@users.sourceforge.net (Neale Pickett)
Date: Thu, 19 Sep 2002 15:10:09 -0700
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.23,1.24
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv15839
Modified Files:
tokenizer.py
Log Message:
* In case of MessageParseError, just tokenize everything in the
message (including headers) as though it were the body of the
message. Thanks for the numerous tips, Tim!
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.23
retrieving revision 1.24
diff -C2 -d -r1.23 -r1.24
*** tokenizer.py 17 Sep 2002 17:57:39 -0000 1.23
--- tokenizer.py 19 Sep 2002 22:10:07 -0000 1.24
***************
*** 1,2 ****
--- 1,3 ----
+ #! /usr/bin/env python
"""Module to tokenize email messages for spam filtering."""
***************
*** 840,856 ****
# Create an email Message object.
try:
! if hasattr(obj, "readline"):
! return email.message_from_file(obj)
! else:
! return email.message_from_string(obj)
except email.Errors.MessageParseError:
! return None
def tokenize(self, obj):
msg = self.get_message(obj)
- if msg is None:
- yield 'control: MessageParseError'
- # XXX Fall back to the raw body text?
- return
for tok in self.tokenize_headers(msg):
--- 841,855 ----
# Create an email Message object.
try:
! if hasattr(obj, "read"):
! obj = obj.read()
! return email.message_from_string(obj)
except email.Errors.MessageParseError:
! # XXX: This puts the headers in the payload...
! msg = email.Message.Message()
! msg.set_payload(obj)
! return msg
def tokenize(self, obj):
msg = self.get_message(obj)
for tok in self.tokenize_headers(msg):
From gward@users.sourceforge.net Fri Sep 20 00:29:34 2002
From: gward@users.sourceforge.net (Greg Ward)
Date: Thu, 19 Sep 2002 16:29:34 -0700
Subject: [Spambayes-checkins] website docs.ht,1.1.1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/website
In directory usw-pr-cvs1:/tmp/cvs-serv5197
Modified Files:
docs.ht
Log Message:
Spell Tim's name right.
Beef up the glossary -- tighter (and more standard, IMHO) definition of spam.
Index: docs.ht
===================================================================
RCS file: /cvsroot/spambayes/website/docs.ht,v
retrieving revision 1.1.1.1
retrieving revision 1.2
diff -C2 -d -r1.1.1.1 -r1.2
*** docs.ht 19 Sep 2002 08:40:55 -0000 1.1.1.1
--- docs.ht 19 Sep 2002 23:29:32 -0000 1.2
***************
*** 32,36 ****
CVS commit messages
! Tim Peter's has whacked a whole lot of useful information into CVS
commit messages. As the project was moved from an obscure corner of the
python CVS tree, there's actually two sources of CVS commits.
--- 32,36 ----
CVS commit messages
! Tim Peters has whacked a whole lot of useful information into CVS
commit messages. As the project was moved from an obscure corner of the
python CVS tree, there's actually two sources of CVS commits.
***************
*** 52,62 ****
A useful(?) glossary of terminology
! - ham
- a non-spam. an email that is wanted by the user.
!
- f-n
- false negative
!
- f-p
- false positive
- false negative
- a spam that's incorrectly classified as ham.
- false positive
- a ham that's incorrectly classified as spam.
-
- spam
- an email that's not wanted by the end-user.
--- 52,69 ----
A useful(?) glossary of terminology
! - spam
- broadly speaking: any email that's not wanted by the
! end-user. More specifically: unsolicited bulk email; email
! that you do not want and did not ask for, and was sent to
! a whole bunch of people by automated means at the same time
! it was sent to you. This definition deliberately excludes viruses
! and those stupid jokes sent to you by your Aunt Tillie.
!
!
- ham
- the opposite of spam; not necessarily email that you want or
! that you asked for, just anything that's not unsolicited bulk email.
- false negative
- a spam that's incorrectly classified as ham.
- false positive
- a ham that's incorrectly classified as spam.
+
- f-n, FN
- (abbrev.) false negative
+
- f-p, FP
- (abbrev.) false positive
From gward@users.sourceforge.net Fri Sep 20 00:39:26 2002
From: gward@users.sourceforge.net (Greg Ward)
Date: Thu, 19 Sep 2002 16:39:26 -0700
Subject: [Spambayes-checkins]
website background.ht,NONE,1.1 docs.ht,1.2,1.3 links.h,1.1.1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/website
In directory usw-pr-cvs1:/tmp/cvs-serv7867
Modified Files:
docs.ht links.h
Added Files:
background.ht
Log Message:
Moved a big chunk of docs.ht to new file background.ht.
--- NEW FILE: background.ht ---
Title: SpamBayes: Background Reading
Author-Email: spambayes@python.org
Background Reading
Theory
Sharpen your pencils, this is the mathematical background (such as it is).
- The paper that started the ball rolling:
Paul Graham's A Plan for Spam.
- Gary Robinson has an
interesting essay
suggesting some improvements to Graham's original approach.
more links? mail anthony at interlink.com.au
Mailing list archives
There's a lot of background on what's been tried available from
the mailing list archives. Initially, the discussion started on
the python-dev list, but then moved to the
spambayes list.
CVS commit messages
Tim Peters has whacked a whole lot of useful information into CVS
commit messages. As the project was moved from an obscure corner of the
python CVS tree, there's actually two sources of CVS commits.
- The older CVS repository via view CVS, or the entire changelog. Development here stopped on the 6th of September 2002.
- After that, work moved to this project's CVS tree
Index: docs.ht
===================================================================
RCS file: /cvsroot/spambayes/website/docs.ht,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** docs.ht 19 Sep 2002 23:29:32 -0000 1.2
--- docs.ht 19 Sep 2002 23:39:24 -0000 1.3
***************
*** 3,44 ****
Author: spambayes
- Background reading
-
- - The paper that started the ball rolling:
- Paul Graham's A Plan for Spam.
-
- Gary Robinson has an
- interesting essay
- suggesting some improvements to Graham's original approach.
-
- more links? mail anthony at interlink.com.au
-
- Mailing list archives
- There's a lot of background on what's been tried available from
- the mailing list archives. Initially, the discussion started on
- the python-dev list, but then moved to the
- spambayes list.
-
-
-
- CVS commit messages
- Tim Peters has whacked a whole lot of useful information into CVS
- commit messages. As the project was moved from an obscure corner of the
- python CVS tree, there's actually two sources of CVS commits.
-
-
- - The older CVS repository via view CVS, or the entire changelog. Development here stopped on the 6th of September 2002.
-
- After that, work moved to this project's CVS tree
-
-
Project documentation
--- 3,6 ----
Index: links.h
===================================================================
RCS file: /cvsroot/spambayes/website/links.h,v
retrieving revision 1.1.1.1
retrieving revision 1.2
diff -C2 -d -r1.1.1.1 -r1.2
*** links.h 19 Sep 2002 08:40:55 -0000 1.1.1.1
--- links.h 19 Sep 2002 23:39:24 -0000 1.2
***************
*** 1,4 ****
--- 1,5 ----
SpamBayes
- Home page
+
- Background
- Documentation
- Developers
From nascheme@users.sourceforge.net Fri Sep 20 04:14:44 2002
From: nascheme@users.sourceforge.net (Neil Schemenauer)
Date: Thu, 19 Sep 2002 20:14:44 -0700
Subject: [Spambayes-checkins] spambayes neilfilter.py,1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv25139
Modified Files:
neilfilter.py
Log Message:
implement Maildir delivery. This allows the script to be used in a .qmail or
.forward file without a wrapper script.
Index: neilfilter.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/neilfilter.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** neilfilter.py 9 Sep 2002 21:21:54 -0000 1.1
--- neilfilter.py 20 Sep 2002 03:14:42 -0000 1.2
***************
*** 1,21 ****
#! /usr/bin/env python
! """Usage: %(program)s wordprobs.cdb
"""
import sys
import os
import email
from heapq import heapreplace
from sets import Set
from classifier import MIN_SPAMPROB, MAX_SPAMPROB, UNKNOWN_SPAMPROB, \
MAX_DISCRIMINATORS
- import cdb
program = sys.argv[0] # For usage(); referenced by docstring above
! from tokenizer import tokenize
! def spamprob(wordprobs, wordstream, evidence=False):
"""Return best-guess probability that wordstream is spam.
--- 1,27 ----
#! /usr/bin/env python
! """Usage: %(program)s wordprobs.cdb Maildir Spamdir
"""
import sys
import os
+ import time
+ import signal
+ import socket
import email
from heapq import heapreplace
from sets import Set
+ import cdb
+ from tokenizer import tokenize
from classifier import MIN_SPAMPROB, MAX_SPAMPROB, UNKNOWN_SPAMPROB, \
MAX_DISCRIMINATORS
program = sys.argv[0] # For usage(); referenced by docstring above
! BLOCK_SIZE = 10000
! SIZE_LIMIT = 5000000 # messages larger are not analyzed
! SPAM_THRESHOLD = 0.9
! def spamprob(wordprobs, wordstream):
"""Return best-guess probability that wordstream is spam.
***************
*** 24,31 ****
wordstream is an iterable object producing words.
The return value is a float in [0.0, 1.0].
-
- If optional arg evidence is True, the return value is a pair
- probability, evidence
- where evidence is a list of (word, probability) pairs.
"""
--- 30,33 ----
***************
*** 70,74 ****
# to tend in part to cancel out distortions introduced earlier by
# HAMBIAS. Experiments will decide the issue.
- clues = []
# First cancel out competing extreme clues (see comment block at
--- 72,75 ----
***************
*** 83,89 ****
# initial clues from the longer list into the probability
# computation.
- for dist, prob, word in shorter + longer[tokeep:]:
- if evidence:
- clues.append((word, prob))
for x in longer[:tokeep]:
heapreplace(nbest, x)
--- 84,87 ----
***************
*** 93,121 ****
if prob is None: # it's one of the dummies nbest started with
continue
- if evidence:
- clues.append((word, prob))
prob_product *= prob
inverse_prob_product *= 1.0 - prob
prob = prob_product / (prob_product + inverse_prob_product)
! if evidence:
! clues.sort(lambda a, b: cmp(a[1], b[1]))
! return prob, clues
! else:
! return prob
!
! def formatclues(clues, sep="; "):
! """Format the clues into something readable."""
! return sep.join(["%r: %.2f" % (word, prob) for word, prob in clues])
! def is_spam(wordprobs, input):
! """Filter (judge) a message"""
! msg = email.message_from_file(input)
! prob, clues = spamprob(wordprobs, tokenize(msg), True)
! #print "%.2f;" % prob, formatclues(clues)
! if prob < 0.9:
! return False
! else:
! return True
def usage(code, msg=''):
--- 91,118 ----
if prob is None: # it's one of the dummies nbest started with
continue
prob_product *= prob
inverse_prob_product *= 1.0 - prob
prob = prob_product / (prob_product + inverse_prob_product)
! return prob
! def maketmp(dir):
! hostname = socket.gethostname()
! pid = os.getpid()
! fd = -1
! for x in xrange(200):
! filename = "%d.%d.%s" % (time.time(), pid, hostname)
! pathname = "%s/tmp/%s" % (dir, filename)
! try:
! fd = os.open(pathname, os.O_WRONLY|os.O_CREAT|os.O_EXCL, 0600)
! except IOError, exc:
! if exc[i] not in (errno.EINT, errno.EEXIST):
! raise
! else:
! break
! time.sleep(2)
! if fd == -1:
! raise SystemExit, "could not create a mail file"
! return (os.fdopen(fd, "wb"), pathname, filename)
def usage(code, msg=''):
***************
*** 128,139 ****
def main():
! if len(sys.argv) != 2:
usage(2)
! wordprobs = cdb.Cdb(open(sys.argv[1], 'rb'))
! if is_spam(wordprobs, sys.stdin):
! sys.exit(1)
! else:
! sys.exit(0)
if __name__ == "__main__":
--- 125,171 ----
def main():
! if len(sys.argv) != 4:
usage(2)
! wordprobfilename = sys.argv[1]
! hamdir = sys.argv[2]
! spamdir = sys.argv[3]
!
! signal.signal(signal.SIGALRM, lambda s: sys.exit(1))
! signal.alarm(24 * 60 * 60)
!
! # write message to temporary file (must be on same partition)
! tmpfile, pathname, filename = maketmp(hamdir)
! try:
! tmpfile.write(os.environ.get("DTLINE", "")) # delivered-to line
! bytes = 0
! blocks = []
! while 1:
! block = sys.stdin.read(BLOCK_SIZE)
! if not block:
! break
! bytes += len(block)
! if bytes < SIZE_LIMIT:
! blocks.append(block)
! tmpfile.write(block)
! tmpfile.close()
!
! if bytes < SIZE_LIMIT:
! msgdata = ''.join(blocks)
! del blocks
! msg = email.message_from_string(msgdata)
! del msgdata
! wordprobs = cdb.Cdb(open(wordprobfilename, 'rb'))
! prob = spamprob(wordprobs, tokenize(msg))
! else:
! prob = 0.0
!
! if prob > SPAM_THRESHOLD:
! os.rename(pathname, "%s/new/%s" % (spamdir, filename))
! else:
! os.rename(pathname, "%s/new/%s" % (hamdir, filename))
! except:
! os.unlink(pathname)
! raise
if __name__ == "__main__":
From nascheme@users.sourceforge.net Fri Sep 20 04:15:16 2002
From: nascheme@users.sourceforge.net (Neil Schemenauer)
Date: Thu, 19 Sep 2002 20:15:16 -0700
Subject: [Spambayes-checkins] spambayes README.txt,1.20,1.21
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv25271
Modified Files:
README.txt
Log Message:
Add a short description of my scripts.
Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.20
retrieving revision 1.21
diff -C2 -d -r1.20 -r1.21
*** README.txt 18 Sep 2002 22:01:39 -0000 1.20
--- README.txt 20 Sep 2002 03:15:13 -0000 1.21
***************
*** 66,69 ****
--- 66,82 ----
delivery system.
+ neiltrain.py
+ Builds a CDB (constant database) file of word probabilities using
+ spam and non-spam mail. The database in intended for use with
+ neilfilter.py.
+
+ neilfilter.py
+ A delivery agent that uses the CDB created by neiltrain.py and
+ delivers a message to one of two Maildir message folders, depending
+ on the classifier score. Note that both Maildirs must be on the
+ same device. An example .qmail or .forward file would be:
+
+ |python2.3 spambayes/neilfilter.py wordprobs.cdb Maildir/ Mail/Spam/
+
Concrete Test Drivers
From tim_one@users.sourceforge.net Fri Sep 20 06:55:10 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 19 Sep 2002 22:55:10 -0700
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.24,1.25
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv24875
Modified Files:
tokenizer.py
Log Message:
tokenize_headers(): Rearranged for better sanity, updated some comments,
simplified overly tortured logic in basic_header_tokenize.
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.24
retrieving revision 1.25
diff -C2 -d -r1.24 -r1.25
*** tokenizer.py 19 Sep 2002 22:10:07 -0000 1.24
--- tokenizer.py 20 Sep 2002 05:55:08 -0000 1.25
***************
*** 859,900 ****
def tokenize_headers(self, msg):
! # Special tagging of header lines.
# Basic header tokenization
! # Tokenize the contents of each header field just like the
! # text of the message body, using the name of the header as a
! # tag. Tokens look like "header:word". The basic approach is
! # simple and effective, but also very sensitive to biases in
! # the ham and spam collections. For example, if the ham and
! # spam were collected at different times, several headers with
! # date/time information will become the best discriminators.
# (Not just Date, but Received and X-From_.)
if options.basic_header_tokenize:
for k, v in msg.items():
k = k.lower()
- match = False
for rx in self.basic_skip:
! if rx.match(k) is not None:
! match = True
! continue
! if match:
! continue
! for w in subject_word_re.findall(v):
! for t in tokenize_word(w):
! yield "%s:%s" % (k, t)
if options.basic_header_tokenize_only:
return
-
- # XXX TODO Neil Schemenauer has gotten a good start on this
- # XXX (pvt email). The headers in my spam and ham corpora are
- # XXX so different (they came from different sources) that if
- # XXX I include them the classifier's job is trivial. Only
- # XXX some "safe" header lines are included here, where "safe"
- # XXX is specific to my sorry corpora.
-
- # Content-{Type, Disposition} and their params, and charsets.
- for x in msg.walk():
- for w in crack_content_xyz(x):
- yield w
# Subject:
--- 859,904 ----
def tokenize_headers(self, msg):
! # Special tagging of header lines and MIME metadata.
!
! # Content-{Type, Disposition} and their params, and charsets.
! # This is done for all MIME sections.
! for x in msg.walk():
! for w in crack_content_xyz(x):
! yield w
!
! # The rest is solely tokenization of header lines.
! # XXX The headers in my (Tim's) spam and ham corpora are so different
! # XXX (they came from different sources) that including several kinds
! # XXX of header analysis renders the classifier's job trivial. So
! # XXX lots of this is crippled now, controlled by an ever-growing
! # XXX collection of funky options.
# Basic header tokenization
! # Tokenize the contents of each header field in the way Subject lines
! # are tokenized later.
! # XXX Different kinds of tokenization have gotten better results on
! # XXX different header lines. No experiments have been run on
! # XXX whether the best choice is being made for each of the header
! # XXX lines tokenized by this section.
! # The name of the header is used as a tag. Tokens look like
! # "header:word". The basic approach is simple and effective, but
! # also very sensitive to biases in the ham and spam collections.
! # For example, if the ham and spam were collected at different
! # times, several headers with date/time information will become
! # the best discriminators.
# (Not just Date, but Received and X-From_.)
if options.basic_header_tokenize:
for k, v in msg.items():
k = k.lower()
for rx in self.basic_skip:
! if rx.match(k):
! break # do nothing -- we're supposed to skip this
! else:
! # Never found a match -- don't skip this.
! for w in subject_word_re.findall(v):
! for t in tokenize_word(w):
! yield "%s:%s" % (k, t)
if options.basic_header_tokenize_only:
return
# Subject:
From tim_one@users.sourceforge.net Fri Sep 20 07:00:08 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 19 Sep 2002 23:00:08 -0700
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.25,1.26
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv25864
Modified Files:
tokenizer.py
Log Message:
crack_uuencode(): Added a note about an obscure efficiency gimmick I
relied on but didn't think to mention before.
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.25
retrieving revision 1.26
diff -C2 -d -r1.25 -r1.26
*** tokenizer.py 20 Sep 2002 05:55:08 -0000 1.25
--- tokenizer.py 20 Sep 2002 06:00:06 -0000 1.26
***************
*** 769,773 ****
# is (new_text, sequence_of_tokens), where new_text no longer contains
# uuencoded stuff. Note that we're not bothering to decode it! Maybe
! # we should.
def crack_uuencode(text):
new_text = []
--- 769,781 ----
# is (new_text, sequence_of_tokens), where new_text no longer contains
# uuencoded stuff. Note that we're not bothering to decode it! Maybe
! # we should. One of my persistent false negatives is a spam containing
! # nothing but a uuencoded money.txt; OTOH, uuencoded seems to be on
! # its way out (that's an old spam).
! #
! # Efficiency note: This is cheaper than it looks if there aren't any
! # uuencoded sections. Under the covers, string[0:] is optimized to
! # return string (no new object is built), and likewise ''.join([string])
! # is optimized to return string. It would actually slow this code down
! # to special-case these "do nothing" special cases at the Python level!
def crack_uuencode(text):
new_text = []
From tim_one@users.sourceforge.net Fri Sep 20 07:03:14 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 19 Sep 2002 23:03:14 -0700
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.26,1.27
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv26473
Modified Files:
tokenizer.py
Log Message:
Removed the code in support of tokenizing src= thingies. It was all
commented out because it made no difference when enabled. Note that
we pick up all http:// thingies regardless of their context anyway.
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.26
retrieving revision 1.27
diff -C2 -d -r1.26 -r1.27
*** tokenizer.py 20 Sep 2002 06:00:06 -0000 1.26
--- tokenizer.py 20 Sep 2002 06:03:12 -0000 1.27
***************
*** 578,593 ****
subject_word_re = re.compile(r"[\w\x80-\xff$.%]+")
- # Anthony Baxter reported goodness from cracking src params.
- # Finding a src= thingie is complicated if we insist it appear in an
- # img or iframe tag, so this approximates reality with a fast and
- # non-stack-blowing simple regexp.
- src_re = re.compile(r"""
- \s
- src=['"]
- (?!https?:) # we suck out http thingies via a different gimmick
- ([^'"]{1,128}) # capture the guts, but don't go wild
- ['"]
- """, re.VERBOSE)
-
fname_sep_re = re.compile(r'[/\\:]')
--- 578,581 ----
***************
*** 1012,1026 ****
for t in tokens:
yield t
-
- # Anthony Baxter reported goodness from tokenizing src= params.
- # XXX This made no difference in my tests: both error rates
- # XXX across 20 runs were identical before and after. I suspect
- # XXX this is because Anthony got most good out of the http
- # XXX thingies in , but we
- # XXX picked those up in the last step (in src params and
- # XXX everywhere else). So this code is commented out.
- ## for fname in src_re.findall(text):
- ## for x in crack_filename(fname):
- ## yield "src:" + x
# Remove HTML/XML tags.
--- 1000,1003 ----
From tim_one@users.sourceforge.net Fri Sep 20 07:06:15 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 19 Sep 2002 23:06:15 -0700
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.27,1.28
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv27328
Modified Files:
tokenizer.py
Log Message:
tokenize_body(): Brought the docstring into line with current reality.
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.27
retrieving revision 1.28
diff -C2 -d -r1.27 -r1.28
*** tokenizer.py 20 Sep 2002 06:03:12 -0000 1.27
--- tokenizer.py 20 Sep 2002 06:06:13 -0000 1.28
***************
*** 965,976 ****
"""Generate a stream of tokens from an email Message.
- If a multipart/alternative section has both text/plain and text/html
- sections, the text/html section is ignored. This may not be a good
- idea (e.g., the sections may have different content).
-
HTML tags are always stripped from text/plain sections.
-
options.retain_pure_html_tags controls whether HTML tags are
! also stripped from text/html sections.
"""
--- 965,977 ----
"""Generate a stream of tokens from an email Message.
HTML tags are always stripped from text/plain sections.
options.retain_pure_html_tags controls whether HTML tags are
! also stripped from text/html sections. Except in special cases,
! it's recommended to leave that at its default of false.
!
! If a multipart/alternative section has both text/plain and text/html
! sections, options.ignore_redundant_html controls whether the HTML
! part is ignored. Except in special cases, it's recommended to
! leave that at its default of false.
"""
From tim_one@users.sourceforge.net Fri Sep 20 07:18:26 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 19 Sep 2002 23:18:26 -0700
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.28,1.29
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv29244
Modified Files:
tokenizer.py
Log Message:
get_message(): Added docstring. Reduced useless nesting. Moved
inappropriate code out of a try block. In case of a message parse
error, used a cheap trick to try to get rid of the (probably malformed)
headers before wrapping the text in a bare Message object.
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.28
retrieving revision 1.29
diff -C2 -d -r1.28 -r1.29
*** tokenizer.py 20 Sep 2002 06:06:13 -0000 1.28
--- tokenizer.py 20 Sep 2002 06:18:24 -0000 1.29
***************
*** 832,848 ****
def get_message(self, obj):
if isinstance(obj, email.Message.Message):
return obj
! else:
! # Create an email Message object.
! try:
! if hasattr(obj, "read"):
! obj = obj.read()
! return email.message_from_string(obj)
! except email.Errors.MessageParseError:
! # XXX: This puts the headers in the payload...
! msg = email.Message.Message()
! msg.set_payload(obj)
! return msg
def tokenize(self, obj):
--- 832,864 ----
def get_message(self, obj):
+ """Return an email Message object.
+
+ The argument may be a Message object already, in which case it's
+ returned as-is.
+
+ If the argument is a string or file-like object (supports read()),
+ the email package is used to create a Message object from it. This
+ can fail if the message is malformed. In that case, the headers
+ (everything through the first blank line) are thrown out, and the
+ rest of the text is wrapped in a bare email.Message.Message.
+ """
+
if isinstance(obj, email.Message.Message):
return obj
! # Create an email Message object.
! if hasattr(obj, "read"):
! obj = obj.read()
! try:
! msg = email.message_from_string(obj)
! except email.Errors.MessageParseError:
! # Wrap the raw text in a bare Message object. Since the
! # headers are most likely damaged, we can't use the email
! # package to parse them, so just get rid of them first.
! i = obj.find('\n\n')
! if i >= 0:
! obj = obj[i+2:] # strip headers
! msg = email.Message.Message()
! msg.set_payload(obj)
! return msg
def tokenize(self, obj):
From montanaro@users.sourceforge.net Fri Sep 20 16:24:57 2002
From: montanaro@users.sourceforge.net (Skip Montanaro)
Date: Fri, 20 Sep 2002 08:24:57 -0700
Subject: [Spambayes-checkins] spambayes .cvsignore,1.2,1.3
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv16668
Modified Files:
.cvsignore
Log Message:
ignore the Data directory
Index: .cvsignore
===================================================================
RCS file: /cvsroot/spambayes/spambayes/.cvsignore,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** .cvsignore 7 Sep 2002 05:53:12 -0000 1.2
--- .cvsignore 20 Sep 2002 15:24:54 -0000 1.3
***************
*** 5,6 ****
--- 5,7 ----
*.zip
build
+ Data
From gvanrossum@users.sourceforge.net Fri Sep 20 20:30:54 2002
From: gvanrossum@users.sourceforge.net (Guido van Rossum)
Date: Fri, 20 Sep 2002 12:30:54 -0700
Subject: [Spambayes-checkins]
spambayes mboxutils.py,NONE,1.1 hammie.py,1.17,1.18 splitndirs.py,1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv3545
Modified Files:
hammie.py splitndirs.py
Added Files:
mboxutils.py
Log Message:
Moved hammie's getmbox out to a separate module, mboxutils, and
enhanced it to support a syntax to designate multiple MH mailboxes.
Augmented splitndirs.py to use this so it can work on MH mailbox
directories as well as on Unix mailboxes.
--- NEW FILE: mboxutils.py ---
"""Utilities for dealing with various types of mailboxes.
This is mostly a wrapper around the various useful classes in the
standard mailbox module, to do some intelligent guessing of the
mailbox type given a mailbox argument.
+foo -- MH mailbox +foo
+foo,bar -- MH mailboxes +foo and +bar concatenated
+ALL -- a shortcut for *all* MH mailboxes
/foo/bar -- (existing file) a Unix-style mailbox
/foo/bar/ -- (existing directory) a directory full of .txt and .lorien
files
/foo/Mail/bar/ -- (existing directory with /Mail/ in its path)
alternative way of spelling an MH mailbox
"""
from __future__ import generators
import os
import glob
import email
import mailbox
class DirOfTxtFileMailbox:
"""Mailbox directory consisting of .txt and .lorien files."""
def __init__(self, dirname, factory):
self.names = (glob.glob(os.path.join(dirname, "*.txt")) +
glob.glob(os.path.join(dirname, "*.lorien")))
self.names.sort()
self.factory = factory
def __iter__(self):
for name in self.names:
try:
f = open(name)
except IOError:
continue
yield self.factory(f)
f.close()
def _factory(fp):
# Helper for getmbox
try:
return email.message_from_file(fp)
except email.Errors.MessageParseError:
return ''
def _cat(seqs):
for seq in seqs:
for item in seq:
yield item
def getmbox(name):
"""Return an mbox iterator given a file/directory/folder name."""
if name.startswith("+"):
# MH folder name: +folder, +f1,f2,f2, or +ALL
name = name[1:]
import mhlib
mh = mhlib.MH()
if name == "ALL":
names = mh.listfolders()
elif ',' in name:
names = name.split(',')
else:
names = [name]
mboxes = []
mhpath = mh.getpath()
for name in names:
filename = os.path.join(mhpath, name)
mbox = mailbox.MHMailbox(filename, _factory)
mboxes.append(mbox)
if len(mboxes) == 1:
return iter(mboxes[0])
else:
return _cat(mboxes)
if os.path.isdir(name):
# XXX Bogus: use an MHMailbox if the pathname contains /Mail/,
# else a DirOfTxtFileMailbox.
if name.find("/Mail/") >= 0:
mbox = mailbox.MHMailbox(name, _factory)
else:
mbox = DirOfTxtFileMailbox(name, _factory)
else:
fp = open(name)
mbox = mailbox.PortableUnixMailbox(fp, _factory)
return iter(mbox)
Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.17
retrieving revision 1.18
diff -C2 -d -r1.17 -r1.18
*** hammie.py 18 Sep 2002 22:01:39 -0000 1.17
--- hammie.py 20 Sep 2002 19:30:52 -0000 1.18
***************
*** 34,42 ****
import glob
import email
- import classifier
import errno
import anydbm
import cPickle as pickle
program = sys.argv[0] # For usage(); referenced by docstring above
--- 34,44 ----
import glob
import email
import errno
import anydbm
import cPickle as pickle
+ import mboxutils
+ import classifier
+
program = sys.argv[0] # For usage(); referenced by docstring above
***************
*** 171,220 ****
- class DirOfTxtFileMailbox:
-
- """Mailbox directory consisting of .txt files."""
-
- def __init__(self, dirname, factory):
- self.names = glob.glob(os.path.join(dirname, "*.txt"))
- self.factory = factory
-
- def __iter__(self):
- for name in self.names:
- try:
- f = open(name)
- except IOError:
- continue
- yield self.factory(f)
- f.close()
-
-
- def getmbox(msgs):
- """Return an iterable mbox object given a file/directory/folder name."""
- def _factory(fp):
- try:
- return email.message_from_file(fp)
- except email.Errors.MessageParseError:
- return ''
-
- if msgs.startswith("+"):
- import mhlib
- mh = mhlib.MH()
- mbox = mailbox.MHMailbox(os.path.join(mh.getpath(), msgs[1:]),
- _factory)
- elif os.path.isdir(msgs):
- # XXX Bogus: use an MHMailbox if the pathname contains /Mail/,
- # else a DirOfTxtFileMailbox.
- if msgs.find("/Mail/") >= 0:
- mbox = mailbox.MHMailbox(msgs, _factory)
- else:
- mbox = DirOfTxtFileMailbox(msgs, _factory)
- else:
- fp = open(msgs)
- mbox = mailbox.PortableUnixMailbox(fp, _factory)
- return mbox
-
def train(bayes, msgs, is_spam):
"""Train bayes with all messages from a mailbox."""
! mbox = getmbox(msgs)
i = 0
for msg in mbox:
--- 173,179 ----
def train(bayes, msgs, is_spam):
"""Train bayes with all messages from a mailbox."""
! mbox = mboxutils.getmbox(msgs)
i = 0
for msg in mbox:
***************
*** 247,251 ****
"""Score (judge) all messages from a mailbox."""
# XXX The reporting needs work!
! mbox = getmbox(msgs)
i = 0
spams = hams = 0
--- 206,210 ----
"""Score (judge) all messages from a mailbox."""
# XXX The reporting needs work!
! mbox = mboxutils.getmbox(msgs)
i = 0
spams = hams = 0
Index: splitndirs.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/splitndirs.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** splitndirs.py 8 Sep 2002 12:55:33 -0000 1.1
--- splitndirs.py 20 Sep 2002 19:30:52 -0000 1.2
***************
*** 47,50 ****
--- 47,52 ----
import getopt
+ import mboxutils
+
program = sys.argv[0]
***************
*** 86,90 ****
inputpath, outputbasepath = args
- infile = file(inputpath, 'rb')
outdirs = [outputbasepath + ("%d" % i) for i in range(1, n+1)]
for dir in outdirs:
--- 88,91 ----
***************
*** 92,96 ****
os.makedirs(dir)
! mbox = mailbox.PortableUnixMailbox(infile, _factory)
counter = 0
for msg in mbox:
--- 93,97 ----
os.makedirs(dir)
! mbox = mboxutils.getmbox(inputpath)
counter = 0
for msg in mbox:
***************
*** 104,113 ****
if verbose:
if counter % 100 == 0:
! print '.',
if verbose:
print
print counter, "messages split into", n, "directories"
- infile.close()
if __name__ == '__main__':
--- 105,114 ----
if verbose:
if counter % 100 == 0:
! sys.stdout.write('.')
! sys.stdout.flush()
if verbose:
print
print counter, "messages split into", n, "directories"
if __name__ == '__main__':
From gvanrossum@users.sourceforge.net Fri Sep 20 20:32:28 2002
From: gvanrossum@users.sourceforge.net (Guido van Rossum)
Date: Fri, 20 Sep 2002 12:32:28 -0700
Subject: [Spambayes-checkins] spambayes neiltrain.py,1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv4565
Modified Files:
neiltrain.py
Log Message:
Use mboxutils instead of a copy of getmbox().
Index: neiltrain.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/neiltrain.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** neiltrain.py 9 Sep 2002 21:21:54 -0000 1.1
--- neiltrain.py 20 Sep 2002 19:32:26 -0000 1.2
***************
*** 8,13 ****
--- 8,15 ----
import mailbox
import email
+
import classifier
import cdb
+ import mboxutils
program = sys.argv[0] # For usage(); referenced by docstring above
***************
*** 15,46 ****
from tokenizer import tokenize
- def getmbox(msgs):
- """Return an iterable mbox object"""
- def _factory(fp):
- try:
- return email.message_from_file(fp)
- except email.Errors.MessageParseError:
- return ''
-
- if msgs.startswith("+"):
- import mhlib
- mh = mhlib.MH()
- mbox = mailbox.MHMailbox(os.path.join(mh.getpath(), msgs[1:]),
- _factory)
- elif os.path.isdir(msgs):
- # XXX Bogus: use an MHMailbox if the pathname contains /Mail/,
- # else a DirOfTxtFileMailbox.
- if msgs.find("/Mail/") >= 0:
- mbox = mailbox.MHMailbox(msgs, _factory)
- else:
- mbox = DirOfTxtFileMailbox(msgs, _factory)
- else:
- fp = open(msgs)
- mbox = mailbox.PortableUnixMailbox(fp, _factory)
- return mbox
-
def train(bayes, msgs, is_spam):
"""Train bayes with all messages from a mailbox."""
! mbox = getmbox(msgs)
for msg in mbox:
bayes.learn(tokenize(msg), is_spam, False)
--- 17,23 ----
from tokenizer import tokenize
def train(bayes, msgs, is_spam):
"""Train bayes with all messages from a mailbox."""
! mbox = mboxutils.getmbox(msgs)
for msg in mbox:
bayes.learn(tokenize(msg), is_spam, False)
From gvanrossum@users.sourceforge.net Fri Sep 20 21:00:48 2002
From: gvanrossum@users.sourceforge.net (Guido van Rossum)
Date: Fri, 20 Sep 2002 13:00:48 -0700
Subject: [Spambayes-checkins] spambayes splitndirs.py,1.2,1.3
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv13883
Modified Files:
splitndirs.py
Log Message:
Another refinement: in order to make nice training sets out of Bruce
G's spam collections, this script now supports multiple input mboxes.
Index: splitndirs.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/splitndirs.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** splitndirs.py 20 Sep 2002 19:30:52 -0000 1.2
--- splitndirs.py 20 Sep 2002 20:00:45 -0000 1.3
***************
*** 3,7 ****
"""Split an mbox into N random directories of files.
! Usage: %(program)s [-h] [-s seed] [-v] -n N sourcembox outdirbase
Options:
--- 3,7 ----
"""Split an mbox into N random directories of files.
! Usage: %(program)s [-h] [-s seed] [-v] -n N sourcembox ... outdirbase
Options:
***************
*** 84,90 ****
usage(1, "an -n value > 1 is required")
! if len(args) != 2:
usage(1, "input mbox name and output base path are required")
! inputpath, outputbasepath = args
outdirs = [outputbasepath + ("%d" % i) for i in range(1, n+1)]
--- 84,90 ----
usage(1, "an -n value > 1 is required")
! if len(args) < 2:
usage(1, "input mbox name and output base path are required")
! inputpaths, outputbasepath = args[:-1], args[-1]
outdirs = [outputbasepath + ("%d" % i) for i in range(1, n+1)]
***************
*** 93,110 ****
os.makedirs(dir)
- mbox = mboxutils.getmbox(inputpath)
counter = 0
! for msg in mbox:
! i = random.randrange(n)
! astext = str(msg)
! #assert astext.endswith('\n')
! counter += 1
! msgfile = open('%s/%d' % (outdirs[i], counter), 'wb')
! msgfile.write(astext)
! msgfile.close()
! if verbose:
! if counter % 100 == 0:
! sys.stdout.write('.')
! sys.stdout.flush()
if verbose:
--- 93,111 ----
os.makedirs(dir)
counter = 0
! for inputpath in inputpaths:
! mbox = mboxutils.getmbox(inputpath)
! for msg in mbox:
! i = random.randrange(n)
! astext = str(msg)
! #assert astext.endswith('\n')
! counter += 1
! msgfile = open('%s/%d' % (outdirs[i], counter), 'wb')
! msgfile.write(astext)
! msgfile.close()
! if verbose:
! if counter % 100 == 0:
! sys.stdout.write('.')
! sys.stdout.flush()
if verbose:
From tim_one@users.sourceforge.net Sat Sep 21 01:15:18 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Fri, 20 Sep 2002 17:15:18 -0700
Subject: [Spambayes-checkins] spambayes classifier.py,1.13,1.14
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv24783
Modified Files:
classifier.py
Log Message:
Removed xspamprob() -- it's unused.
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.13
retrieving revision 1.14
diff -C2 -d -r1.13 -r1.14
*** classifier.py 19 Sep 2002 06:30:15 -0000 1.13
--- classifier.py 21 Sep 2002 00:15:16 -0000 1.14
***************
*** 361,535 ****
return prob
- # The same as spamprob(), except uses a corrected probability computation
- # accounting for P(spam) and P(not-spam). Since my training corpora had
- # a ham/spam ratio of 4000/2750, I'm in a good position to test this.
- # Using xspamprob() clearly made a major reduction in the false negative
- # rate, cutting it in half on some runs (this is after the f-n rate had
- # already been cut by a factor of 5 via other refinements). It also
- # uncovered two more very brief spams hiding in the ham corpora.
- #
- # OTOH, the # of fps increased. Especially vulnerable are extremely
- # short msgs of the "subscribe me"/"unsubscribe me" variety (while these
- # don't belong on a mailing list, they're not spam), and brief reasonable
- # msgs that simply don't have much evidence (to the human eye) to go on.
- # These were boderline before, and it's easy to push them over the edge.
- # For example, one f-p had subject
- #
- # Any Interest in EDIFACT Parser/Generator?
- #
- # and the body just
- #
- # Just curious.
- # --jim
- #
- # "Interest" in the subject line had spam prob 0.99, "curious." 0.01,
- # and nothing else was strong. Since my ham/spam ratio is bigger than
- # 1, any clue favoring spam favors spam more strongly under xspamprob()
- # than under spamprob().
- #
- # XXX Somewhat like spamprob(), learn() also computes probabilities as
- # XXX if the # of hams and spams were the same. If that were also
- # XXX fiddled to take nham and nspam into account (nb: I realize it
- # XXX already *looks* like it does -- but it doesn't), it would reduce
- # XXX the spam probabilities in my test run, and *perhaps* xspamprob
- # XXX wouldn't have such bad effect on the f-p story.
- #
- # Here are the comparative stats, with spamprob() in the left column and
- # xspamprob() in the right, across 20 runs:
- #
- # false positive percentages
- # 0.000 0.000 tied
- # 0.000 0.050 lost
- # 0.050 0.100 lost
- # 0.000 0.075 lost
- # 0.025 0.050 lost
- # 0.025 0.100 lost
- # 0.050 0.150 lost
- # 0.025 0.050 lost
- # 0.025 0.050 lost
- # 0.000 0.050 lost
- # 0.075 0.150 lost
- # 0.050 0.075 lost
- # 0.025 0.050 lost
- # 0.000 0.050 lost
- # 0.050 0.125 lost
- # 0.025 0.075 lost
- # 0.025 0.025 tied
- # 0.000 0.025 lost
- # 0.025 0.100 lost
- # 0.050 0.150 lost
- #
- # won 0 times
- # tied 2 times
- # lost 18 times
- #
- # total unique fp went from 8 to 30
- #
- # false negative percentages
- # 0.945 0.473 won
- # 0.836 0.582 won
- # 1.200 0.618 won
- # 1.418 0.836 won
- # 1.455 0.836 won
- # 1.091 0.691 won
- # 1.091 0.618 won
- # 1.236 0.691 won
- # 1.564 1.018 won
- # 1.236 0.618 won
- # 1.563 0.981 won
- # 1.563 0.800 won
- # 1.236 0.618 won
- # 0.836 0.400 won
- # 0.873 0.400 won
- # 1.236 0.545 won
- # 1.273 0.691 won
- # 1.018 0.327 won
- # 1.091 0.473 won
- # 1.490 0.618 won
- #
- # won 20 times
- # tied 0 times
- # lost 0 times
- #
- # total unique fn went from 292 to 162
- #
- # XXX This needs to be updated to incorporate the "cancel out competing
- # XXX extreme clues" twist.
- def xspamprob(self, wordstream, evidence=False):
- """Return best-guess probability that wordstream is spam.
-
- wordstream is an iterable object producing words.
- The return value is a float in [0.0, 1.0].
-
- If optional arg evidence is True, the return value is a pair
- probability, evidence
- where evidence is a list of (word, probability) pairs.
- """
-
- # A priority queue to remember the MAX_DISCRIMINATORS best
- # probabilities, where "best" means largest distance from 0.5.
- # The tuples are (distance, prob, word, wordinfo[word]).
- nbest = [(-1.0, None, None, None)] * MAX_DISCRIMINATORS
- smallest_best = -1.0
-
- # Counting a unique word multiple times hurts, although counting one
- # at most two times had some benefit whan UNKNOWN_SPAMPROB was 0.2.
- # When that got boosted to 0.5, counting more than once became
- # counterproductive.
- unique_words = {}
-
- wordinfoget = self.wordinfo.get
- now = time.time()
-
- for word in wordstream:
- if word in unique_words:
- continue
- unique_words[word] = 1
-
- record = wordinfoget(word)
- if record is None:
- prob = UNKNOWN_SPAMPROB
- else:
- record.atime = now
- prob = record.spamprob
-
- distance = abs(prob - 0.5)
- if distance > smallest_best:
- # Subtle: we didn't use ">" instead of ">=" just to save
- # calls to heapreplace(). The real intent is that if
- # there are many equally strong indicators throughout the
- # message, we want to favor the ones that appear earliest:
- # it's expected that spam headers will often have smoking
- # guns, and, even when not, spam has to grab your attention
- # early (& note that when spammers generate large blocks of
- # random gibberish to throw off exact-match filters, it's
- # always at the end of the msg -- if they put it at the
- # start, *nobody* would read the msg).
- heapreplace(nbest, (distance, prob, word, record))
- smallest_best = nbest[0][0]
-
- # Compute the probability.
- if evidence:
- clues = []
- sp = float(self.nspam) / (self.nham + self.nspam)
- hp = 1.0 - sp
- prob_product = sp
- inverse_prob_product = hp
- for distance, prob, word, record in nbest:
- if prob is None: # it's one of the dummies nbest started with
- continue
- if record is not None: # else wordinfo doesn't know about it
- record.killcount += 1
- if evidence:
- clues.append((word, prob))
- prob_product *= prob / sp
- inverse_prob_product *= (1.0 - prob) / hp
-
- prob = prob_product / (prob_product + inverse_prob_product)
- if evidence:
- return prob, clues
- else:
- return prob
-
def learn(self, wordstream, is_spam, update_probabilities=True):
--- 361,364 ----
From tim_one@users.sourceforge.net Sat Sep 21 03:46:23 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Fri, 20 Sep 2002 19:46:23 -0700
Subject: [Spambayes-checkins]
spambayes Options.py,1.20,1.21 classifier.py,1.14,1.15
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv20258
Modified Files:
Options.py classifier.py
Log Message:
Added some speculative options for more of Gary Robinson's ideas. Will
explain on the spambayes list.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.20
retrieving revision 1.21
diff -C2 -d -r1.20 -r1.21
*** Options.py 19 Sep 2002 09:34:56 -0000 1.20
--- Options.py 21 Sep 2002 02:46:20 -0000 1.21
***************
*** 150,155 ****
max_discriminators: 16
! # Use Gary Robinson's scheme for combining probabilities.
use_robinson_probability: False
"""
--- 150,168 ----
max_discriminators: 16
! ###########################################################################
! # Speculative options for Gary Robinson's ideas. These may go away, or
! # a bunch of incompatible stuff above may go away.
!
! # Use Gary's scheme for combining probabilities.
! use_robinson_combining: False
!
! # Use Gary's scheme for computing probabilities, along with its "a" and
! # "x" parameters.
use_robinson_probability: False
+ robinson_probability_a: 1.0
+ robinson_probability_x: 0.5
+
+ # Use Gary's scheme for ranking probabilities.
+ use_robinson_ranking: False
"""
***************
*** 189,193 ****
--- 202,210 ----
'unknown_spamprob': float_cracker,
'max_discriminators': int_cracker,
+ 'use_robinson_combining': boolean_cracker,
'use_robinson_probability': boolean_cracker,
+ 'robinson_probability_a': float_cracker,
+ 'robinson_probability_x': float_cracker,
+ 'use_robinson_ranking': boolean_cracker,
},
}
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.14
retrieving revision 1.15
diff -C2 -d -r1.14 -r1.15
*** classifier.py 21 Sep 2002 00:15:16 -0000 1.14
--- classifier.py 21 Sep 2002 02:46:20 -0000 1.15
***************
*** 314,357 ****
heapreplace(nbest, x)
! if options.use_robinson_probability:
! # This combination method is due to Gary Robinson.
! # http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
! # In preliminary tests, it did just as well as Graham's scheme,
! # but creates a definite "middle ground" around 0.5 where false
! # negatives and false positives can actually found in non-trivial
! # number.
! P = Q = 1.0
! num_clues = 0
! for distance, prob, word, record in nbest:
! if prob is None: # it's one of the dummies nbest started with
! continue
! if record is not None: # else wordinfo doesn't know about it
! record.killcount += 1
! if evidence:
! clues.append((word, prob))
! num_clues += 1
! P *= 1.0 - prob
! Q *= prob
!
! if num_clues:
! P = 1.0 - P**(1./num_clues)
! Q = 1.0 - Q**(1./num_clues)
! prob = (P-Q)/(P+Q) # in -1 .. 1
! prob = 0.5 + prob/2 # shift to 0 .. 1
! else:
! prob = 0.5
! else:
! prob_product = inverse_prob_product = 1.0
! for distance, prob, word, record in nbest:
! if prob is None: # it's one of the dummies nbest started with
! continue
! if record is not None: # else wordinfo doesn't know about it
! record.killcount += 1
! if evidence:
! clues.append((word, prob))
! prob_product *= prob
! inverse_prob_product *= 1.0 - prob
! prob = prob_product / (prob_product + inverse_prob_product)
if evidence:
--- 314,329 ----
heapreplace(nbest, x)
! prob_product = inverse_prob_product = 1.0
! for distance, prob, word, record in nbest:
! if prob is None: # it's one of the dummies nbest started with
! continue
! if record is not None: # else wordinfo doesn't know about it
! record.killcount += 1
! if evidence:
! clues.append((word, prob))
! prob_product *= prob
! inverse_prob_product *= 1.0 - prob
! prob = prob_product / (prob_product + inverse_prob_product)
if evidence:
***************
*** 361,365 ****
return prob
-
def learn(self, wordstream, is_spam, update_probabilities=True):
"""Teach the classifier by example.
--- 333,336 ----
***************
*** 479,480 ****
--- 450,601 ----
if record.hamcount == 0 == record.spamcount:
del self.wordinfo[word]
+
+
+ #************************************************************************
+ # Some options change so much behavior that it's better to write a
+ # different method.
+ # CAUTION: These end up overwriting methods of the same name above.
+ # A subclass would be cleaner, but experiments will soon enough lead
+ # to only one of the alternatives surviving.
+
+ def robinson_spamprob(self, wordstream, evidence=False):
+ """Return best-guess probability that wordstream is spam.
+
+ wordstream is an iterable object producing words.
+ The return value is a float in [0.0, 1.0].
+
+ If optional arg evidence is True, the return value is a pair
+ probability, evidence
+ where evidence is a list of (word, probability) pairs.
+ """
+
+ from math import frexp
+
+ # A priority queue to remember the MAX_DISCRIMINATORS best
+ # probabilities, where "best" means largest distance from 0.5.
+ # The tuples are (distance, prob, word, wordinfo[word]).
+ nbest = [(-1.0, None, None, None)] * MAX_DISCRIMINATORS
+ smallest_best = -1.0
+
+ wordinfoget = self.wordinfo.get
+ now = time.time()
+ for word in Set(wordstream):
+ record = wordinfoget(word)
+ if record is None:
+ prob = UNKNOWN_SPAMPROB
+ else:
+ record.atime = now
+ prob = record.spamprob
+
+ distance = abs(prob - 0.5)
+ if distance > smallest_best:
+ heapreplace(nbest, (distance, prob, word, record))
+ smallest_best = nbest[0][0]
+
+ # Compute the probability.
+ clues = []
+
+ # This combination method is due to Gary Robinson.
+ # http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
+ # In preliminary tests, it did just as well as Graham's scheme,
+ # but creates a definite "middle ground" around 0.5 where false
+ # negatives and false positives can actually found in non-trivial
+ # number.
+
+ # The real P = this P times 2**Pexp. Likewise for Q. We're
+ # simulating unbounding dynamic float range by hand. If this pans
+ # out, *maybe* we should store logarithms in the database instead
+ # and just add them here.
+ P = Q = 1.0
+ Pexp = Qexp = 0
+ num_clues = 0
+ for distance, prob, word, record in nbest:
+ if prob is None: # it's one of the dummies nbest started with
+ continue
+ if record is not None: # else wordinfo doesn't know about it
+ record.killcount += 1
+ if evidence:
+ clues.append((word, prob))
+ num_clues += 1
+ P *= 1.0 - prob
+ Q *= prob
+ if P < 1e-200: # move back into range
+ P, e = frexp(P)
+ Pexp += e
+ if Q < 1e-200: # move back into range
+ Q, e = frexp(Q)
+ Qexp += e
+
+ P, e = frexp(P)
+ Pexp += e
+ Q, e = frexp(Q)
+ Qexp += e
+
+ if num_clues:
+ #P = 1.0 - P**(1./num_clues)
+ #Q = 1.0 - Q**(1./num_clues)
+ #
+ # (x*2**e)**n = x**n * 2**(e*n)
+ n = 1.0 / num_clues
+ P = 1.0 - P**n * 2.0**(Pexp * n)
+ Q = 1.0 - P**n * 2.0**(Qexp * n)
+
+ prob = (P-Q)/(P+Q) # in -1 .. 1
+ prob = 0.5 + prob/2 # shift to 0 .. 1
+ else:
+ prob = 0.5
+
+ if evidence:
+ clues.sort(lambda a, b: cmp(a[1], b[1]))
+ return prob, clues
+ else:
+ return prob
+
+ if options.use_robinson_combining:
+ spamprob = robinson_spamprob
+
+ def robinson_update_probabilities(self):
+ """Update the word probabilities in the spam database.
+
+ This computes a new probability for every word in the database,
+ so can be expensive. learn() and unlearn() update the probabilities
+ each time by default. Thay have an optional argument that allows
+ to skip this step when feeding in many messages, and in that case
+ you should call update_probabilities() after feeding the last
+ message and before calling spamprob().
+ """
+
+ nham = float(self.nham or 1)
+ nspam = float(self.nspam or 1)
+ A = options.robinson_probability_a
+ X = options.robinson_probability_x
+ AoverX = A/X
+ for word, record in self.wordinfo.iteritems():
+ # Compute prob(msg is spam | msg contains word).
+ # This is the Graham calculation, but stripped of biases, and
+ # of clamping into 0.01 thru 0.99.
+ hamcount = min(record.hamcount, nham)
+ hamratio = hamcount / nham
+
+ spamcount = min(record.spamcount, nspam)
+ spamratio = spamcount / nspam
+
+ prob = spamratio / (hamratio + spamratio)
+
+ # Now do Robinson's Bayesian adjustment.
+ #
+ # a + (n * p(w))
+ # f(w) = ---------------
+ # (a / x) + n
+ n = hamcount + spamratio
+ prob = (A + n * prob) / (AoverX + n)
+
+ if record.spamprob != prob:
+ record.spamprob = prob
+ # The next seemingly pointless line appears to be a hack
+ # to allow a persistent db to realize the record has changed.
+ self.wordinfo[word] = record
+
+
+ if options.use_robinson_probability:
+ update_probabilities = robinson_update_probabilities
From tim_one@users.sourceforge.net Sat Sep 21 04:43:15 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Fri, 20 Sep 2002 20:43:15 -0700
Subject: [Spambayes-checkins] spambayes classifier.py,1.15,1.16
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv30766
Modified Files:
classifier.py
Log Message:
Fixed two egregious typos in the code (one a cut 'n paste screwup, the
other a word-completion snafu). Curiously, I don't think that repairing
the math is actually going to make much difference!
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.15
retrieving revision 1.16
diff -C2 -d -r1.15 -r1.16
*** classifier.py 21 Sep 2002 02:46:20 -0000 1.15
--- classifier.py 21 Sep 2002 03:43:13 -0000 1.16
***************
*** 504,508 ****
# The real P = this P times 2**Pexp. Likewise for Q. We're
! # simulating unbounding dynamic float range by hand. If this pans
# out, *maybe* we should store logarithms in the database instead
# and just add them here.
--- 504,508 ----
# The real P = this P times 2**Pexp. Likewise for Q. We're
! # simulating unbounded dynamic float range by hand. If this pans
# out, *maybe* we should store logarithms in the database instead
# and just add them here.
***************
*** 539,543 ****
n = 1.0 / num_clues
P = 1.0 - P**n * 2.0**(Pexp * n)
! Q = 1.0 - P**n * 2.0**(Qexp * n)
prob = (P-Q)/(P+Q) # in -1 .. 1
--- 539,543 ----
n = 1.0 / num_clues
P = 1.0 - P**n * 2.0**(Pexp * n)
! Q = 1.0 - Q**n * 2.0**(Qexp * n)
prob = (P-Q)/(P+Q) # in -1 .. 1
***************
*** 588,592 ****
# f(w) = ---------------
# (a / x) + n
! n = hamcount + spamratio
prob = (A + n * prob) / (AoverX + n)
--- 588,593 ----
# f(w) = ---------------
# (a / x) + n
!
! n = hamcount + spamcount
prob = (A + n * prob) / (AoverX + n)
From montanaro@users.sourceforge.net Sat Sep 21 15:15:34 2002
From: montanaro@users.sourceforge.net (Skip Montanaro)
Date: Sat, 21 Sep 2002 07:15:34 -0700
Subject: [Spambayes-checkins] spambayes rebal.py,1.3,1.4
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv27918
Modified Files:
rebal.py
Log Message:
provide a weak check against mixing ham and spam
Index: rebal.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/rebal.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** rebal.py 14 Sep 2002 03:32:47 -0000 1.3
--- rebal.py 21 Sep 2002 14:15:30 -0000 1.4
***************
*** 127,130 ****
--- 127,138 ----
return 1
+ # weak check against mixing ham and spam
+ if ("Ham" in setpfx and "Spam" in resdir or
+ "Spam" in setpfx and "Ham" in resdir):
+ yn = raw_input("Reservoir and Set dirs appear not to match. "
+ "Continue? (y/n) ")
+ if yn.lower()[0:1] != 'y':
+ return 1
+
# if necessary, migrate random files to the reservoir
for (dir, fs) in stuff:
From tim_one@users.sourceforge.net Sat Sep 21 21:25:52 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sat, 21 Sep 2002 13:25:52 -0700
Subject: [Spambayes-checkins]
spambayes Options.py,1.21,1.22 classifier.py,1.16,1.17
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv18775
Modified Files:
Options.py classifier.py
Log Message:
New option robinson_minimum_prob_strength. On my large test, and on
small random-subset tests, setting this to 0.1 yields (and
max_discriminators to 1500) a remarkable improvement in the f-n rate,
even over what the all-default (Graham-like) scheme does.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.21
retrieving revision 1.22
diff -C2 -d -r1.21 -r1.22
*** Options.py 21 Sep 2002 02:46:20 -0000 1.21
--- Options.py 21 Sep 2002 20:25:49 -0000 1.22
***************
*** 96,102 ****
# yet any bound in sight for how low this can go (0.075 would work as
# well as 0.90 on Tim's large c.l.py data).
! # For Gary Robinson's scheme, 0.50 works best for *us*. Other people
! # who have implemented Graham's scheme, and stuck to it in most respects,
! # report values closer to 0.70 working best for them.
spam_cutoff: 0.90
--- 96,103 ----
# yet any bound in sight for how low this can go (0.075 would work as
# well as 0.90 on Tim's large c.l.py data).
! # For Gary Robinson's scheme, some value between 0.50 and 0.60 has worked
! # best in all reports so far. Note that you can easily deduce the effect
! # of setting spam_cutoff to any particular value by studying the score
! # histograms -- there's no need to run a test again to see what would happen.
spam_cutoff: 0.90
***************
*** 153,156 ****
--- 154,158 ----
# Speculative options for Gary Robinson's ideas. These may go away, or
# a bunch of incompatible stuff above may go away.
+ # CAUTION: evidence to date suggest setting spam_cutoff
# Use Gary's scheme for combining probabilities.
***************
*** 165,168 ****
--- 167,184 ----
# Use Gary's scheme for ranking probabilities.
use_robinson_ranking: False
+
+ # When scoring a message, ignore all words with
+ # abs(word.spamprob - 0.5) < robinson_minimum_prob_strength.
+ # By default (0.0), nothing is ignored.
+ # Tim got a pretty clear improvement in f-n rate on his hasn't-improved-in-
+ # a-long-time large c.l.py test by using 0.1. No other values have been
+ # tried yet.
+ # Neil Schemenauer also reported good results from 0.1, making the all-
+ # Robinson scheme match the all-default Graham-like scheme on a smaller
+ # and different corpus.
+ # NOTE: Changing this may change the best spam_cutoff value for your
+ # corpus. Since one effect is to separate the means more, you'll probably
+ # want a higher spam_cutoff.
+ robinson_minimum_prob_strength: 0.0
"""
***************
*** 207,210 ****
--- 223,227 ----
'robinson_probability_x': float_cracker,
'use_robinson_ranking': boolean_cracker,
+ 'robinson_minimum_prob_strength': float_cracker,
},
}
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.16
retrieving revision 1.17
diff -C2 -d -r1.16 -r1.17
*** classifier.py 21 Sep 2002 03:43:13 -0000 1.16
--- classifier.py 21 Sep 2002 20:25:49 -0000 1.17
***************
*** 471,474 ****
--- 471,475 ----
from math import frexp
+ mindist = options.robinson_minimum_prob_strength
# A priority queue to remember the MAX_DISCRIMINATORS best
***************
*** 489,493 ****
distance = abs(prob - 0.5)
! if distance > smallest_best:
heapreplace(nbest, (distance, prob, word, record))
smallest_best = nbest[0][0]
--- 490,494 ----
distance = abs(prob - 0.5)
! if distance >= mindist and distance > smallest_best:
heapreplace(nbest, (distance, prob, word, record))
smallest_best = nbest[0][0]
From tim_one@users.sourceforge.net Sat Sep 21 22:11:52 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sat, 21 Sep 2002 14:11:52 -0700
Subject: [Spambayes-checkins] spambayes Options.py,1.22,1.23
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv30007
Modified Files:
Options.py
Log Message:
Nuked a stray sentence fragment in a comment.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.22
retrieving revision 1.23
diff -C2 -d -r1.22 -r1.23
*** Options.py 21 Sep 2002 20:25:49 -0000 1.22
--- Options.py 21 Sep 2002 21:11:50 -0000 1.23
***************
*** 154,158 ****
# Speculative options for Gary Robinson's ideas. These may go away, or
# a bunch of incompatible stuff above may go away.
- # CAUTION: evidence to date suggest setting spam_cutoff
# Use Gary's scheme for combining probabilities.
--- 154,157 ----
From tim_one@users.sourceforge.net Sat Sep 21 22:19:43 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sat, 21 Sep 2002 14:19:43 -0700
Subject: [Spambayes-checkins] spambayes rebal.py,1.4,1.5
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv31474
Modified Files:
rebal.py
Log Message:
Stopped making -Q imply -q: these are very different kinds of messages,
and it wasn't at all clear from the docs that -Q would imply -q.
Index: rebal.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/rebal.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** rebal.py 21 Sep 2002 14:15:30 -0000 1.4
--- rebal.py 21 Sep 2002 21:19:40 -0000 1.5
***************
*** 9,17 ****
-r res - specify an alternate reservoir [%(RESDIR)s]
-s set - specify an alternate Set pfx [%(SETPFX)s]
! -n num - specify number of files per dir [%(NPERDIR)s]
-v - tell user what's happening [%(VERBOSE)s]
-q - be quiet about what's happening [not %(VERBOSE)s]
-c - confirm file moves into Set directory [%(CONFIRM)s]
! -Q - be quiet and don't confirm moves
The script will work with a variable number of Set directories, but they
--- 9,17 ----
-r res - specify an alternate reservoir [%(RESDIR)s]
-s set - specify an alternate Set pfx [%(SETPFX)s]
! -n num - specify number of files per Set dir desired [%(NPERDIR)s]
-v - tell user what's happening [%(VERBOSE)s]
-q - be quiet about what's happening [not %(VERBOSE)s]
-c - confirm file moves into Set directory [%(CONFIRM)s]
! -Q - don't confirm moves; this is independent of -v/-q
The script will work with a variable number of Set directories, but they
***************
*** 104,108 ****
verbose = False
elif opt == "-Q":
! verbose = confirm = False
elif opt == "-h":
usage()
--- 104,108 ----
verbose = False
elif opt == "-Q":
! confirm = False
elif opt == "-h":
usage()
From gvanrossum@users.sourceforge.net Sun Sep 22 01:19:01 2002
From: gvanrossum@users.sourceforge.net (Guido van Rossum)
Date: Sat, 21 Sep 2002 17:19:01 -0700
Subject: [Spambayes-checkins] spambayes rates.py,1.4,1.5
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv2806
Modified Files:
rates.py
Log Message:
When basename.txt doesn't exist, try basename.
Index: rates.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/rates.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** rates.py 14 Sep 2002 00:03:51 -0000 1.4
--- rates.py 22 Sep 2002 00:18:58 -0000 1.5
***************
*** 31,35 ****
def doit(basename):
! ifile = file(basename + '.txt')
interesting = filter(lambda line: line.startswith('-> '), ifile)
ifile.close()
--- 31,38 ----
def doit(basename):
! try:
! ifile = file(basename + '.txt')
! except IOError:
! ifile = file(basename)
interesting = filter(lambda line: line.startswith('-> '), ifile)
ifile.close()
From tim_one@users.sourceforge.net Sun Sep 22 05:19:10 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sat, 21 Sep 2002 21:19:10 -0700
Subject: [Spambayes-checkins] spambayes rates.py,1.5,1.6
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv9977
Modified Files:
rates.py
Log Message:
Brought the module docstring back into line with the truth.
Got rid of some unused computations.
Index: rates.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/rates.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** rates.py 22 Sep 2002 00:18:58 -0000 1.5
--- rates.py 22 Sep 2002 04:19:08 -0000 1.6
***************
*** 7,18 ****
basename + '.txt'
! contains output from timtest.py, scans that file for summary statistics,
! displays them to stdout, and also writes them to file
basename + 's.txt'
! (where the 's' means 'summary'). This doesn't need a full output file, and
! will display stuff for as far as the output file has gotten so far.
Two of these summary files can later be fed to cmp.py.
--- 7,22 ----
basename + '.txt'
+ or
+ basename
! contains output from one of the test drivers (timcv, mboxtest, timtest),
! scans that file for summary statistics, displays them to stdout, and also
! writes them to file
basename + 's.txt'
! (where the 's' means 'summary'). This doesn't need a full output file
! from a test run, and will display stuff for as far as the output file
! has gotten so far.
Two of these summary files can later be fed to cmp.py.
***************
*** 49,53 ****
ntests = nfn = nfp = 0
sumfnrate = sumfprate = 0.0
- ntrainedham = ntrainedspam = 0
for line in interesting:
--- 53,56 ----
***************
*** 58,63 ****
#-> tested 4000 hams & 2750 spams against 8000 hams & 5500 spams
if line.startswith('-> tested '):
- ntrainedham += int(fields[-5])
- ntrainedspam += int(fields[-2])
ntests += 1
continue
--- 61,64 ----
From tim_one@users.sourceforge.net Sun Sep 22 05:59:56 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sat, 21 Sep 2002 21:59:56 -0700
Subject: [Spambayes-checkins] spambayes LICENSE.txt,NONE,1.1
README.txt,1.21,1.22
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv16347
Modified Files:
README.txt
Added Files:
LICENSE.txt
Log Message:
Added a simplified version of the PSF license to the project, and asserted
copyright for the PSF. "Simplified" means got rid of references to Python,
and dropped the stack of BeOpen/CNRI/CWI licenses (they clearly have no
claim on *this* software).
--- NEW FILE: LICENSE.txt ---
Copyright (C) 2002 Python Software Foundation; All Rights Reserved
The Python Software Foundation (PSF) holds copyright on all material
in this project. You may use it under the terms of the PSF license:
PSF LICENSE AGREEMENT FOR THE SPAMBAYES PROJECT
-----------------------------------------------
1. This LICENSE AGREEMENT is between the Python Software Foundation
("PSF"), and the Individual or Organization ("Licensee") accessing and
otherwise using the spambayes software ("Software") in source or binary
form and its associated documentation.
2. Subject to the terms and conditions of this License Agreement, PSF
hereby grants Licensee a nonexclusive, royalty-free, world-wide
license to reproduce, analyze, test, perform and/or display publicly,
prepare derivative works, distribute, and otherwise use the Software
alone or in any derivative version, provided, however, that PSF's
License Agreement and PSF's notice of copyright, i.e., "Copyright (c)
2002 Python Software Foundation; All Rights Reserved" are retained
the Software alone or in any derivative version prepared by Licensee.
3. In the event Licensee prepares a derivative work that is based on
or incorporates the Software or any part thereof, and wants to make
the derivative work available to others as provided herein, then
Licensee hereby agrees to include in any such work a brief summary of
the changes made to the Software.
4. PSF is making the Software available to Licensee on an "AS IS"
basis. PSF MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, PSF MAKES NO AND
DISCLAIMS ANY REPRESENTATION OR WARRANTY OF MERCHANTABILITY OR FITNESS
FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF THE SOFTWARE WILL NOT
INFRINGE ANY THIRD PARTY RIGHTS.
5. PSF SHALL NOT BE LIABLE TO LICENSEE OR ANY OTHER USERS OF THE
SOFTWARE FOR ANY INCIDENTAL, SPECIAL, OR CONSEQUENTIAL DAMAGES OR LOSS AS
A RESULT OF MODIFYING, DISTRIBUTING, OR OTHERWISE USING THE SOFTWARE,
OR ANY DERIVATIVE THEREOF, EVEN IF ADVISED OF THE POSSIBILITY THEREOF.
6. This License Agreement will automatically terminate upon a material
breach of its terms and conditions.
7. Nothing in this License Agreement shall be deemed to create any
relationship of agency, partnership, or joint venture between PSF and
Licensee. This License Agreement does not grant permission to use PSF
trademarks or trade name in a trademark sense to endorse or promote
products or services of Licensee, or any third party.
8. By copying, installing or otherwise using the Software, Licensee
agrees to be bound by the terms and conditions of this License
Agreement.
Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.21
retrieving revision 1.22
diff -C2 -d -r1.21 -r1.22
*** README.txt 20 Sep 2002 03:15:13 -0000 1.21
--- README.txt 22 Sep 2002 04:59:54 -0000 1.22
***************
*** 1,2 ****
--- 1,9 ----
+ Copyright (C) 2002 Python Software Foundation; All Rights Reserved
+
+ The Python Software Foundation (PSF) holds copyright on all material
+ in this project. You may use it under the terms of the PSF license;
+ see LICENSE.txt.
+
+
Assorted clues.
***************
*** 70,74 ****
spam and non-spam mail. The database in intended for use with
neilfilter.py.
!
neilfilter.py
A delivery agent that uses the CDB created by neiltrain.py and
--- 77,81 ----
spam and non-spam mail. The database in intended for use with
neilfilter.py.
!
neilfilter.py
A delivery agent that uses the CDB created by neiltrain.py and
From gvanrossum@users.sourceforge.net Sun Sep 22 07:58:38 2002
From: gvanrossum@users.sourceforge.net (Guido van Rossum)
Date: Sat, 21 Sep 2002 23:58:38 -0700
Subject: [Spambayes-checkins]
spambayes heapq.py,NONE,1.1 sets.py,NONE,1.1 TestDriver.py,1.4,1.5
cdb.py,1.2,1.3 hammie.py,1.18,1.19 mboxtest.py,1.7,1.8
timcv.py,1.6,1.7 timtest.py,1.26,1.27 tokenizer.py,1.29,1.30
unheader.py,1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv32550
Modified Files:
TestDriver.py cdb.py hammie.py mboxtest.py timcv.py timtest.py
tokenizer.py unheader.py
Added Files:
heapq.py sets.py
Log Message:
Make this Python 2.2.1-compatible[*], by:
- adding "from __future__ import generators" to all files using
'yield' in 6 files;
- spelling out "for i, x in enumerate(s)" using range(len(s)) etc. in
2 files;
- in tokenizer.py, changing get_content_maintype() and
get_content_type() into get_main_type('text') and
get_type('text/plain'), respectively (the defaults are necessary
because these older APIs default to None rather than to text/plain
as they should in most contexts. [**]
I haven't tried to run all tools, but I've tried timcv.py, rates.py
and cmp.py. This invokes most code that Tim wrote. I grepped for
enumerate() and yield.
[*] But not Python 2.2-compatible. There are too many places using
True or False (none using bool() though).
[**] XXX should a 'text/plain' default be added to other uses of
get_type() in tokenizer.py? The default is None, and I see one place
that asks "if part.get_type() == 'text/plain'".
--- NEW FILE: heapq.py ---
# -*- coding: Latin-1 -*-
"""Heap queue algorithm (a.k.a. priority queue).
Heaps are arrays for which a[k] <= a[2*k+1] and a[k] <= a[2*k+2] for
all k, counting elements from 0. For the sake of comparison,
non-existing elements are considered to be infinite. The interesting
property of a heap is that a[0] is always its smallest element.
Usage:
heap = [] # creates an empty heap
heappush(heap, item) # pushes a new item on the heap
item = heappop(heap) # pops the smallest item from the heap
item = heap[0] # smallest item on the heap without popping it
heapify(x) # transforms list into a heap, in-place, in linear time
item = heapreplace(heap, item) # pops and returns smallest item, and adds
# new item; the heap size is unchanged
Our API differs from textbook heap algorithms as follows:
- We use 0-based indexing. This makes the relationship between the
index for a node and the indexes for its children slightly less
obvious, but is more suitable since Python uses 0-based indexing.
- Our heappop() method returns the smallest item, not the largest.
These two make it possible to view the heap as a regular Python list
without surprises: heap[0] is the smallest item, and heap.sort()
maintains the heap invariant!
"""
# Original code by Kevin O'Connor, augmented by Tim Peters
__about__ = """Heap queues
[explanation by François Pinard]
Heaps are arrays for which a[k] <= a[2*k+1] and a[k] <= a[2*k+2] for
all k, counting elements from 0. For the sake of comparison,
non-existing elements are considered to be infinite. The interesting
property of a heap is that a[0] is always its smallest element.
The strange invariant above is meant to be an efficient memory
representation for a tournament. The numbers below are `k', not a[k]:
0
1 2
3 4 5 6
7 8 9 10 11 12 13 14
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
In the tree above, each cell `k' is topping `2*k+1' and `2*k+2'. In
an usual binary tournament we see in sports, each cell is the winner
over the two cells it tops, and we can trace the winner down the tree
to see all opponents s/he had. However, in many computer applications
of such tournaments, we do not need to trace the history of a winner.
To be more memory efficient, when a winner is promoted, we try to
replace it by something else at a lower level, and the rule becomes
that a cell and the two cells it tops contain three different items,
but the top cell "wins" over the two topped cells.
If this heap invariant is protected at all time, index 0 is clearly
the overall winner. The simplest algorithmic way to remove it and
find the "next" winner is to move some loser (let's say cell 30 in the
diagram above) into the 0 position, and then percolate this new 0 down
the tree, exchanging values, until the invariant is re-established.
This is clearly logarithmic on the total number of items in the tree.
By iterating over all items, you get an O(n ln n) sort.
A nice feature of this sort is that you can efficiently insert new
items while the sort is going on, provided that the inserted items are
not "better" than the last 0'th element you extracted. This is
especially useful in simulation contexts, where the tree holds all
incoming events, and the "win" condition means the smallest scheduled
time. When an event schedule other events for execution, they are
scheduled into the future, so they can easily go into the heap. So, a
heap is a good structure for implementing schedulers (this is what I
used for my MIDI sequencer :-).
Various structures for implementing schedulers have been extensively
studied, and heaps are good for this, as they are reasonably speedy,
the speed is almost constant, and the worst case is not much different
than the average case. However, there are other representations which
are more efficient overall, yet the worst cases might be terrible.
Heaps are also very useful in big disk sorts. You most probably all
know that a big sort implies producing "runs" (which are pre-sorted
sequences, which size is usually related to the amount of CPU memory),
followed by a merging passes for these runs, which merging is often
very cleverly organised[1]. It is very important that the initial
sort produces the longest runs possible. Tournaments are a good way
to that. If, using all the memory available to hold a tournament, you
replace and percolate items that happen to fit the current run, you'll
produce runs which are twice the size of the memory for random input,
and much better for input fuzzily ordered.
Moreover, if you output the 0'th item on disk and get an input which
may not fit in the current tournament (because the value "wins" over
the last output value), it cannot fit in the heap, so the size of the
heap decreases. The freed memory could be cleverly reused immediately
for progressively building a second heap, which grows at exactly the
same rate the first heap is melting. When the first heap completely
vanishes, you switch heaps and start a new run. Clever and quite
effective!
In a word, heaps are useful memory structures to know. I use them in
a few applications, and I think it is good to keep a `heap' module
around. :-)
--------------------
[1] The disk balancing algorithms which are current, nowadays, are
more annoying than clever, and this is a consequence of the seeking
capabilities of the disks. On devices which cannot seek, like big
tape drives, the story was quite different, and one had to be very
clever to ensure (far in advance) that each tape movement will be the
most effective possible (that is, will best participate at
"progressing" the merge). Some tapes were even able to read
backwards, and this was also used to avoid the rewinding time.
Believe me, real good tape sorts were quite spectacular to watch!
>From all times, sorting has always been a Great Art! :-)
"""
def heappush(heap, item):
"""Push item onto heap, maintaining the heap invariant."""
heap.append(item)
_siftdown(heap, 0, len(heap)-1)
def heappop(heap):
"""Pop the smallest item off the heap, maintaining the heap invariant."""
lastelt = heap.pop() # raises appropriate IndexError if heap is empty
if heap:
returnitem = heap[0]
heap[0] = lastelt
_siftup(heap, 0)
else:
returnitem = lastelt
return returnitem
def heapreplace(heap, item):
"""Pop and return the current smallest value, and add the new item.
This is more efficient than heappop() followed by heappush(), and can be
more appropriate when using a fixed-size heap. Note that the value
returned may be larger than item! That constrains reasonable uses of
this routine.
"""
returnitem = heap[0] # raises appropriate IndexError if heap is empty
heap[0] = item
_siftup(heap, 0)
return returnitem
def heapify(x):
"""Transform list into a heap, in-place, in O(len(heap)) time."""
n = len(x)
# Transform bottom-up. The largest index there's any point to looking at
# is the largest with a child index in-range, so must have 2*i + 1 < n,
# or i < (n-1)/2. If n is even = 2*j, this is (2*j-1)/2 = j-1/2 so
# j-1 is the largest, which is n//2 - 1. If n is odd = 2*j+1, this is
# (2*j+1-1)/2 = j so j-1 is the largest, and that's again n//2-1.
for i in xrange(n//2 - 1, -1, -1):
_siftup(x, i)
# 'heap' is a heap at all indices >= startpos, except possibly for pos. pos
# is the index of a leaf with a possibly out-of-order value. Restore the
# heap invariant.
def _siftdown(heap, startpos, pos):
newitem = heap[pos]
# Follow the path to the root, moving parents down until finding a place
# newitem fits.
while pos > startpos:
parentpos = (pos - 1) >> 1
parent = heap[parentpos]
if parent <= newitem:
break
heap[pos] = parent
pos = parentpos
heap[pos] = newitem
# The child indices of heap index pos are already heaps, and we want to make
# a heap at index pos too. We do this by bubbling the smaller child of
# pos up (and so on with that child's children, etc) until hitting a leaf,
# then using _siftdown to move the oddball originally at index pos into place.
#
# We *could* break out of the loop as soon as we find a pos where newitem <=
# both its children, but turns out that's not a good idea, and despite that
# many books write the algorithm that way. During a heap pop, the last array
# element is sifted in, and that tends to be large, so that comparing it
# against values starting from the root usually doesn't pay (= usually doesn't
# get us out of the loop early). See Knuth, Volume 3, where this is
# explained and quantified in an exercise.
#
# Cutting the # of comparisons is important, since these routines have no
# way to extract "the priority" from an array element, so that intelligence
# is likely to be hiding in custom __cmp__ methods, or in array elements
# storing (priority, record) tuples. Comparisons are thus potentially
# expensive.
#
# On random arrays of length 1000, making this change cut the number of
# comparisons made by heapify() a little, and those made by exhaustive
# heappop() a lot, in accord with theory. Here are typical results from 3
# runs (3 just to demonstrate how small the variance is):
#
# Compares needed by heapify Compares needed by 1000 heapppops
# -------------------------- ---------------------------------
# 1837 cut to 1663 14996 cut to 8680
# 1855 cut to 1659 14966 cut to 8678
# 1847 cut to 1660 15024 cut to 8703
#
# Building the heap by using heappush() 1000 times instead required
# 2198, 2148, and 2219 compares: heapify() is more efficient, when
# you can use it.
#
# The total compares needed by list.sort() on the same lists were 8627,
# 8627, and 8632 (this should be compared to the sum of heapify() and
# heappop() compares): list.sort() is (unsurprisingly!) more efficient
# for sorting.
def _siftup(heap, pos):
endpos = len(heap)
startpos = pos
newitem = heap[pos]
# Bubble up the smaller child until hitting a leaf.
childpos = 2*pos + 1 # leftmost child position
while childpos < endpos:
# Set childpos to index of smaller child.
rightpos = childpos + 1
if rightpos < endpos and heap[rightpos] <= heap[childpos]:
childpos = rightpos
# Move the smaller child up.
heap[pos] = heap[childpos]
pos = childpos
childpos = 2*pos + 1
# The leaf at pos is empty now. Put newitem there, and and bubble it up
# to its final resting place (by sifting its parents down).
heap[pos] = newitem
_siftdown(heap, startpos, pos)
if __name__ == "__main__":
# Simple sanity test
heap = []
data = [1, 3, 5, 7, 9, 2, 4, 6, 8, 0]
for item in data:
heappush(heap, item)
sort = []
while heap:
sort.append(heappop(heap))
print sort
--- NEW FILE: sets.py ---
"""Classes to represent arbitrary sets (including sets of sets).
This module implements sets using dictionaries whose values are
ignored. The usual operations (union, intersection, deletion, etc.)
are provided as both methods and operators.
Important: sets are not sequences! While they support 'x in s',
'len(s)', and 'for x in s', none of those operations are unique for
sequences; for example, mappings support all three as well. The
characteristic operation for sequences is subscripting with small
integers: s[i], for i in range(len(s)). Sets don't support
subscripting at all. Also, sequences allow multiple occurrences and
their elements have a definite order; sets on the other hand don't
record multiple occurrences and don't remember the order of element
insertion (which is why they don't support s[i]).
The following classes are provided:
BaseSet -- All the operations common to both mutable and immutable
sets. This is an abstract class, not meant to be directly
instantiated.
Set -- Mutable sets, subclass of BaseSet; not hashable.
ImmutableSet -- Immutable sets, subclass of BaseSet; hashable.
An iterable argument is mandatory to create an ImmutableSet.
_TemporarilyImmutableSet -- Not a subclass of BaseSet: just a wrapper
around a Set, hashable, giving the same hash value as the
immutable set equivalent would have. Do not use this class
directly.
Only hashable objects can be added to a Set. In particular, you cannot
really add a Set as an element to another Set; if you try, what is
actually added is an ImmutableSet built from it (it compares equal to
the one you tried adding).
When you ask if `x in y' where x is a Set and y is a Set or
ImmutableSet, x is wrapped into a _TemporarilyImmutableSet z, and
what's tested is actually `z in y'.
"""
# Code history:
#
# - Greg V. Wilson wrote the first version, using a different approach
# to the mutable/immutable problem, and inheriting from dict.
#
# - Alex Martelli modified Greg's version to implement the current
# Set/ImmutableSet approach, and make the data an attribute.
#
# - Guido van Rossum rewrote much of the code, made some API changes,
# and cleaned up the docstrings.
#
# - Raymond Hettinger added a number of speedups and other
# improvements.
__all__ = ['BaseSet', 'Set', 'ImmutableSet']
class BaseSet(object):
"""Common base class for mutable and immutable sets."""
__slots__ = ['_data']
# Constructor
def __init__(self):
"""This is an abstract class."""
# Don't call this from a concrete subclass!
if self.__class__ is BaseSet:
raise TypeError, ("BaseSet is an abstract class. "
"Use Set or ImmutableSet.")
# Standard protocols: __len__, __repr__, __str__, __iter__
def __len__(self):
"""Return the number of elements of a set."""
return len(self._data)
def __repr__(self):
"""Return string representation of a set.
This looks like 'Set([
])'.
"""
return self._repr()
# __str__ is the same as __repr__
__str__ = __repr__
def _repr(self, sorted=False):
elements = self._data.keys()
if sorted:
elements.sort()
return '%s(%r)' % (self.__class__.__name__, elements)
def __iter__(self):
"""Return an iterator over the elements or a set.
This is the keys iterator for the underlying dict.
"""
return self._data.iterkeys()
# Equality comparisons using the underlying dicts
def __eq__(self, other):
self._binary_sanity_check(other)
return self._data == other._data
def __ne__(self, other):
self._binary_sanity_check(other)
return self._data != other._data
# Copying operations
def copy(self):
"""Return a shallow copy of a set."""
result = self.__class__()
result._data.update(self._data)
return result
__copy__ = copy # For the copy module
def __deepcopy__(self, memo):
"""Return a deep copy of a set; used by copy module."""
# This pre-creates the result and inserts it in the memo
# early, in case the deep copy recurses into another reference
# to this same set. A set can't be an element of itself, but
# it can certainly contain an object that has a reference to
# itself.
from copy import deepcopy
result = self.__class__()
memo[id(self)] = result
data = result._data
value = True
for elt in self:
data[deepcopy(elt, memo)] = value
return result
# Standard set operations: union, intersection, both differences.
# Each has an operator version (e.g. __or__, invoked with |) and a
# method version (e.g. union).
# Subtle: Each pair requires distinct code so that the outcome is
# correct when the type of other isn't suitable. For example, if
# we did "union = __or__" instead, then Set().union(3) would return
# NotImplemented instead of raising TypeError (albeit that *why* it
# raises TypeError as-is is also a bit subtle).
def __or__(self, other):
"""Return the union of two sets as a new set.
(I.e. all elements that are in either set.)
"""
if not isinstance(other, BaseSet):
return NotImplemented
result = self.__class__()
result._data = self._data.copy()
result._data.update(other._data)
return result
def union(self, other):
"""Return the union of two sets as a new set.
(I.e. all elements that are in either set.)
"""
return self | other
def __and__(self, other):
"""Return the intersection of two sets as a new set.
(I.e. all elements that are in both sets.)
"""
if not isinstance(other, BaseSet):
return NotImplemented
if len(self) <= len(other):
little, big = self, other
else:
little, big = other, self
common = filter(big._data.has_key, little._data.iterkeys())
return self.__class__(common)
def intersection(self, other):
"""Return the intersection of two sets as a new set.
(I.e. all elements that are in both sets.)
"""
return self & other
def __xor__(self, other):
"""Return the symmetric difference of two sets as a new set.
(I.e. all elements that are in exactly one of the sets.)
"""
if not isinstance(other, BaseSet):
return NotImplemented
result = self.__class__()
data = result._data
value = True
selfdata = self._data
otherdata = other._data
for elt in selfdata:
if elt not in otherdata:
data[elt] = value
for elt in otherdata:
if elt not in selfdata:
data[elt] = value
return result
def symmetric_difference(self, other):
"""Return the symmetric difference of two sets as a new set.
(I.e. all elements that are in exactly one of the sets.)
"""
return self ^ other
def __sub__(self, other):
"""Return the difference of two sets as a new Set.
(I.e. all elements that are in this set and not in the other.)
"""
if not isinstance(other, BaseSet):
return NotImplemented
result = self.__class__()
data = result._data
otherdata = other._data
value = True
for elt in self:
if elt not in otherdata:
data[elt] = value
return result
def difference(self, other):
"""Return the difference of two sets as a new Set.
(I.e. all elements that are in this set and not in the other.)
"""
return self - other
# Membership test
def __contains__(self, element):
"""Report whether an element is a member of a set.
(Called in response to the expression `element in self'.)
"""
try:
return element in self._data
except TypeError:
transform = getattr(element, "_as_temporarily_immutable", None)
if transform is None:
raise # re-raise the TypeError exception we caught
return transform() in self._data
# Subset and superset test
def issubset(self, other):
"""Report whether another set contains this set."""
self._binary_sanity_check(other)
if len(self) > len(other): # Fast check for obvious cases
return False
otherdata = other._data
for elt in self:
if elt not in otherdata:
return False
return True
def issuperset(self, other):
"""Report whether this set contains another set."""
self._binary_sanity_check(other)
if len(self) < len(other): # Fast check for obvious cases
return False
selfdata = self._data
for elt in other:
if elt not in selfdata:
return False
return True
# Inequality comparisons using the is-subset relation.
__le__ = issubset
__ge__ = issuperset
def __lt__(self, other):
self._binary_sanity_check(other)
return len(self) < len(other) and self.issubset(other)
def __gt__(self, other):
self._binary_sanity_check(other)
return len(self) > len(other) and self.issuperset(other)
# Assorted helpers
def _binary_sanity_check(self, other):
# Check that the other argument to a binary operation is also
# a set, raising a TypeError otherwise.
if not isinstance(other, BaseSet):
raise TypeError, "Binary operation only permitted between sets"
def _compute_hash(self):
# Calculate hash code for a set by xor'ing the hash codes of
# the elements. This ensures that the hash code does not depend
# on the order in which elements are added to the set. This is
# not called __hash__ because a BaseSet should not be hashable;
# only an ImmutableSet is hashable.
result = 0
for elt in self:
result ^= hash(elt)
return result
def _update(self, iterable):
# The main loop for update() and the subclass __init__() methods.
data = self._data
# Use the fast update() method when a dictionary is available.
if isinstance(iterable, BaseSet):
data.update(iterable._data)
return
if isinstance(iterable, dict):
data.update(iterable)
return
value = True
it = iter(iterable)
while True:
try:
for element in it:
data[element] = value
return
except TypeError:
transform = getattr(element, "_as_immutable", None)
if transform is None:
raise # re-raise the TypeError exception we caught
data[transform()] = value
class ImmutableSet(BaseSet):
"""Immutable set class."""
__slots__ = ['_hashcode']
# BaseSet + hashing
def __init__(self, iterable=None):
"""Construct an immutable set from an optional iterable."""
self._hashcode = None
self._data = {}
if iterable is not None:
self._update(iterable)
def __hash__(self):
if self._hashcode is None:
self._hashcode = self._compute_hash()
return self._hashcode
class Set(BaseSet):
""" Mutable set class."""
__slots__ = []
# BaseSet + operations requiring mutability; no hashing
def __init__(self, iterable=None):
"""Construct a set from an optional iterable."""
self._data = {}
if iterable is not None:
self._update(iterable)
def __hash__(self):
"""A Set cannot be hashed."""
# We inherit object.__hash__, so we must deny this explicitly
raise TypeError, "Can't hash a Set, only an ImmutableSet."
# In-place union, intersection, differences.
# Subtle: The xyz_update() functions deliberately return None,
# as do all mutating operations on built-in container types.
# The __xyz__ spellings have to return self, though.
def __ior__(self, other):
"""Update a set with the union of itself and another."""
self._binary_sanity_check(other)
self._data.update(other._data)
return self
def union_update(self, other):
"""Update a set with the union of itself and another."""
self |= other
def __iand__(self, other):
"""Update a set with the intersection of itself and another."""
self._binary_sanity_check(other)
self._data = (self & other)._data
return self
def intersection_update(self, other):
"""Update a set with the intersection of itself and another."""
self &= other
def __ixor__(self, other):
"""Update a set with the symmetric difference of itself and another."""
self._binary_sanity_check(other)
data = self._data
value = True
for elt in other:
if elt in data:
del data[elt]
else:
data[elt] = value
return self
def symmetric_difference_update(self, other):
"""Update a set with the symmetric difference of itself and another."""
self ^= other
def __isub__(self, other):
"""Remove all elements of another set from this set."""
self._binary_sanity_check(other)
data = self._data
for elt in other:
if elt in data:
del data[elt]
return self
def difference_update(self, other):
"""Remove all elements of another set from this set."""
self -= other
# Python dict-like mass mutations: update, clear
def update(self, iterable):
"""Add all values from an iterable (such as a list or file)."""
self._update(iterable)
def clear(self):
"""Remove all elements from this set."""
self._data.clear()
# Single-element mutations: add, remove, discard
def add(self, element):
"""Add an element to a set.
This has no effect if the element is already present.
"""
try:
self._data[element] = True
except TypeError:
transform = getattr(element, "_as_immutable", None)
if transform is None:
raise # re-raise the TypeError exception we caught
self._data[transform()] = True
def remove(self, element):
"""Remove an element from a set; it must be a member.
If the element is not a member, raise a KeyError.
"""
try:
del self._data[element]
except TypeError:
transform = getattr(element, "_as_temporarily_immutable", None)
if transform is None:
raise # re-raise the TypeError exception we caught
del self._data[transform()]
def discard(self, element):
"""Remove an element from a set if it is a member.
If the element is not a member, do nothing.
"""
try:
self.remove(element)
except KeyError:
pass
def pop(self):
"""Remove and return an arbitrary set element."""
return self._data.popitem()[0]
def _as_immutable(self):
# Return a copy of self as an immutable set
return ImmutableSet(self)
def _as_temporarily_immutable(self):
# Return self wrapped in a temporarily immutable set
return _TemporarilyImmutableSet(self)
class _TemporarilyImmutableSet(BaseSet):
# Wrap a mutable set as if it was temporarily immutable.
# This only supplies hashing and equality comparisons.
def __init__(self, set):
self._set = set
self._data = set._data # Needed by ImmutableSet.__eq__()
def __hash__(self):
return self._set._compute_hash()
Index: TestDriver.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** TestDriver.py 14 Sep 2002 00:03:51 -0000 1.4
--- TestDriver.py 22 Sep 2002 06:58:36 -0000 1.5
***************
*** 61,65 ****
format = "%6.2f %" + str(ndigits) + "d"
! for i, n in enumerate(self.buckets):
print format % (100.0 * i / self.nbuckets, n),
print '*' * ((n + hunit - 1) // hunit)
--- 61,66 ----
format = "%6.2f %" + str(ndigits) + "d"
! for i in range(len(self.buckets)):
! n = self.buckets[i]
print format % (100.0 * i / self.nbuckets, n),
print '*' * ((n + hunit - 1) // hunit)
Index: cdb.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/cdb.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** cdb.py 11 Sep 2002 06:21:22 -0000 1.2
--- cdb.py 22 Sep 2002 06:58:36 -0000 1.3
***************
*** 6,9 ****
--- 6,12 ----
"""
+
+ from __future__ import generators
+
import os
import struct
Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.18
retrieving revision 1.19
diff -C2 -d -r1.18 -r1.19
*** hammie.py 20 Sep 2002 19:30:52 -0000 1.18
--- hammie.py 22 Sep 2002 06:58:36 -0000 1.19
***************
*** 28,31 ****
--- 28,33 ----
"""
+ from __future__ import generators
+
import sys
import os
Index: mboxtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/mboxtest.py,v
retrieving revision 1.7
retrieving revision 1.8
diff -C2 -d -r1.7 -r1.8
*** mboxtest.py 17 Sep 2002 17:57:39 -0000 1.7
--- mboxtest.py 22 Sep 2002 06:58:36 -0000 1.8
***************
*** 19,22 ****
--- 19,24 ----
"""
+ from __future__ import generators
+
import getopt
import mailbox
Index: timcv.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timcv.py,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** timcv.py 14 Sep 2002 22:01:42 -0000 1.6
--- timcv.py 22 Sep 2002 06:58:36 -0000 1.7
***************
*** 32,35 ****
--- 32,37 ----
"""
+ from __future__ import generators
+
import os
import sys
Index: timtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtest.py,v
retrieving revision 1.26
retrieving revision 1.27
diff -C2 -d -r1.26 -r1.27
*** timtest.py 14 Sep 2002 00:03:51 -0000 1.26
--- timtest.py 22 Sep 2002 06:58:36 -0000 1.27
***************
*** 18,21 ****
--- 18,23 ----
"""
+ from __future__ import generators
+
import os
import sys
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.29
retrieving revision 1.30
diff -C2 -d -r1.29 -r1.30
*** tokenizer.py 20 Sep 2002 06:18:24 -0000 1.29
--- tokenizer.py 22 Sep 2002 06:58:36 -0000 1.30
***************
*** 2,5 ****
--- 2,7 ----
"""Module to tokenize email messages for spam filtering."""
+ from __future__ import generators
+
import email
import re
***************
*** 513,517 ****
redundant_html = Set()
for part in msg.walk():
! if part.get_content_type() == 'multipart/alternative':
# Descend this part of the tree, adding any redundant HTML text
# part to redundant_html.
--- 515,519 ----
redundant_html = Set()
for part in msg.walk():
! if part.get_type() == 'multipart/alternative':
# Descend this part of the tree, adding any redundant HTML text
# part to redundant_html.
***************
*** 520,524 ****
while stack:
subpart = stack.pop()
! ctype = subpart.get_content_type()
if ctype == 'text/plain':
textpart = subpart
--- 522,526 ----
while stack:
subpart = stack.pop()
! ctype = subpart.get_type('text/plain')
if ctype == 'text/plain':
textpart = subpart
***************
*** 535,539 ****
text.add(htmlpart)
! elif part.get_content_maintype() == 'text':
text.add(part)
--- 537,541 ----
text.add(htmlpart)
! elif part.get_main_type('text') == 'text':
text.add(part)
***************
*** 544,548 ****
# have redundant content, so it goes.
def textparts(msg):
! return Set(filter(lambda part: part.get_content_maintype() == 'text',
msg.walk()))
--- 546,550 ----
# have redundant content, so it goes.
def textparts(msg):
! return Set(filter(lambda part: part.get_main_type('text') == 'text',
msg.walk()))
***************
*** 1019,1023 ****
# Remove HTML/XML tags.
! if (part.get_content_type() == "text/plain" or
not options.retain_pure_html_tags):
text = html_re.sub(' ', text)
--- 1021,1025 ----
# Remove HTML/XML tags.
! if (part.get_type() == "text/plain" or
not options.retain_pure_html_tags):
text = html_re.sub(' ', text)
Index: unheader.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/unheader.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** unheader.py 7 Sep 2002 05:50:42 -0000 1.1
--- unheader.py 22 Sep 2002 06:58:36 -0000 1.2
***************
*** 18,22 ****
"""replace first value for hdr with newval"""
hdr = hdr.lower()
! for (i, (k, v)) in enumerate(self._headers):
if k.lower() == hdr:
self._headers[i] = (k, newval)
--- 18,23 ----
"""replace first value for hdr with newval"""
hdr = hdr.lower()
! for i in range(len(self._headers)):
! k, v = self._headers[i]
if k.lower() == hdr:
self._headers[i] = (k, newval)
From tim_one@users.sourceforge.net Sun Sep 22 08:45:30 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 22 Sep 2002 00:45:30 -0700
Subject: [Spambayes-checkins] spambayes rebal.py,1.5,1.6
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv7977
Modified Files:
rebal.py
Log Message:
Removed use of 2.3 "string in string"-ism.
Index: rebal.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/rebal.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** rebal.py 21 Sep 2002 21:19:40 -0000 1.5
--- rebal.py 22 Sep 2002 07:45:27 -0000 1.6
***************
*** 128,133 ****
# weak check against mixing ham and spam
! if ("Ham" in setpfx and "Spam" in resdir or
! "Spam" in setpfx and "Ham" in resdir):
yn = raw_input("Reservoir and Set dirs appear not to match. "
"Continue? (y/n) ")
--- 128,133 ----
# weak check against mixing ham and spam
! if (setpfx.find("Ham") >= 0 and resdir.find("Spam") >= 0 or
! setpfx.find("Spam") >= 0 and resdir.find("Ham") >= 0):
yn = raw_input("Reservoir and Set dirs appear not to match. "
"Continue? (y/n) ")
From anthonybaxter@users.sourceforge.net Sun Sep 22 08:48:05 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Sun, 22 Sep 2002 00:48:05 -0700
Subject: [Spambayes-checkins] website developer.ht,1.2,1.3
Message-ID:
Update of /cvsroot/spambayes/website
In directory usw-pr-cvs1:/tmp/cvs-serv8235
Modified Files:
developer.ht
Log Message:
2.2.1 now supported.
Index: developer.ht
===================================================================
RCS file: /cvsroot/spambayes/website/developer.ht,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** developer.ht 19 Sep 2002 08:57:58 -0000 1.2
--- developer.ht 22 Sep 2002 07:48:03 -0000 1.3
***************
*** 12,16 ****
come crying <wink>.
! This project uses the absolute bleeding edge of python code, available from CVS on sourceforge.
The spambayes code itself is also available via CVS
--- 12,16 ----
come crying <wink>.
! This project works with either the absolute bleeding edge of python code, available from CVS on sourceforge, or with Python 2.2.1 (not 2.2, or 2.1.3).
The spambayes code itself is also available via CVS
From tim_one@users.sourceforge.net Sun Sep 22 09:31:50 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sun, 22 Sep 2002 01:31:50 -0700
Subject: [Spambayes-checkins] spambayes TestDriver.py,1.5,1.6
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv15892
Modified Files:
TestDriver.py
Log Message:
Augmented the Hist class to compute and display mean and (sample) sdev.
Index: TestDriver.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** TestDriver.py 22 Sep 2002 06:58:36 -0000 1.5
--- TestDriver.py 22 Sep 2002 08:31:48 -0000 1.6
***************
*** 36,39 ****
--- 36,42 ----
self.buckets = [0] * nbuckets
self.nbuckets = nbuckets
+ self.n = 0 # number of data points
+ self.sum = 0.0 # sum of their values
+ self.sumsq = 0.0 # sum of their squares
def add(self, x):
***************
*** 44,47 ****
--- 47,55 ----
self.buckets[i] += 1
+ self.n += 1
+ x *= 100.0
+ self.sum += x
+ self.sumsq += x*x
+
def __iadd__(self, other):
if self.nbuckets != other.nbuckets:
***************
*** 49,55 ****
--- 57,77 ----
for i in range(self.nbuckets):
self.buckets[i] += other.buckets[i]
+ self.n += other.n
+ self.sum += other.sum
+ self.sumsq += other.sumsq
return self
def display(self, WIDTH=60):
+ from math import sqrt
+ if self.n > 1:
+ mean = self.sum / self.n
+ # sum (x_i - mean)**2 = sum (x_i**2 - 2*x_i*mean + mean**2) =
+ # sum x_i**2 - 2*mean*sum x_i + sum mean**2 =
+ # sum x_i**2 - 2*mean*mean*n + n*mean**2 =
+ # sum x_i**2 - n*mean**2
+ samplevar = (self.sumsq - self.n * mean**2) / (self.n - 1)
+ print "%d items; mean %.2f; sample sdev %.2f" % (self.n,
+ mean, sqrt(samplevar))
+
biggest = max(self.buckets)
hunit, r = divmod(biggest, WIDTH)
From montanaro@users.sourceforge.net Mon Sep 23 04:13:33 2002
From: montanaro@users.sourceforge.net (Skip Montanaro)
Date: Sun, 22 Sep 2002 20:13:33 -0700
Subject: [Spambayes-checkins] spambayes Options.py,1.23,1.24
tokenizer.py,1.30,1.31
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv1989
Modified Files:
Options.py tokenizer.py
Log Message:
Added two new options: check_octets and octet_prefix_size. If check_octets
is True, any application/octet-stream parts will be tokenized simply by
returning octet_prefix_size bytes of the first line of the base64-encoded
stuff. For example, DOS/Windows executables seem to begin with the string
"TVqQA". If enabled, the token "octet:TVqQA" would be returned for such
sections, providing they had the appropriate content type and transfer
encoding.
By default, check_octets is False, preserving preexisting behavior. I can't
test this very well since I've pretty ruthlessly purged viruses from my Spam
corpu.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.23
retrieving revision 1.24
diff -C2 -d -r1.23 -r1.24
*** Options.py 21 Sep 2002 21:11:50 -0000 1.23
--- Options.py 23 Sep 2002 03:13:30 -0000 1.24
***************
*** 50,53 ****
--- 50,58 ----
ignore_redundant_html: False
+ # If true, the first few characters of application/octet-stream sections
+ # are used, undecoded. What 'few' means is decided by octet_prefix_size.
+ check_octets: False
+ octet_prefix_size: 5
+
# Generate tokens just counting the number of instances of each kind of
# header line, in a case-sensitive way.
***************
*** 193,196 ****
--- 198,203 ----
'count_all_header_lines': boolean_cracker,
'mine_received_headers': boolean_cracker,
+ 'check_octets': boolean_cracker,
+ 'octet_prefix_size': int_cracker,
'basic_header_tokenize': boolean_cracker,
'basic_header_tokenize_only': boolean_cracker,
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.30
retrieving revision 1.31
diff -C2 -d -r1.30 -r1.31
*** tokenizer.py 22 Sep 2002 06:58:36 -0000 1.30
--- tokenizer.py 23 Sep 2002 03:13:31 -0000 1.31
***************
*** 549,552 ****
--- 549,557 ----
msg.walk()))
+ def octetparts(msg):
+ return Set(filter(lambda part:
+ part.get_content_type() == 'application/octet-stream',
+ msg.walk()))
+
url_re = re.compile(r"""
(https? | ftp) # capture the protocol
***************
*** 992,996 ****
--- 997,1011 ----
part is ignored. Except in special cases, it's recommended to
leave that at its default of false.
+
+ If options.check_octets is True, the first few undecoded characters
+ of application/octet-stream parts of the message body become tokens.
"""
+
+ if options.check_octets:
+ # Find, decode application/octet-stream parts of the body,
+ # tokenizing the first few characters of each chunk
+ for part in octetparts(msg):
+ text = part.get_payload(decode=False)
+ yield "octet:%s" % text[:options.octet_prefix_size]
# Find, decode (base64, qp), and tokenize textual parts of the body.
From tim.one@comcast.net Mon Sep 23 05:33:18 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 23 Sep 2002 00:33:18 -0400
Subject: [Spambayes-checkins] spambayes
Options.py,1.23,1.24tokenizer.py,1.30,1.31
In-Reply-To:
Message-ID:
> + def octetparts(msg):
> + return Set(filter(lambda part:
> + part.get_content_type() ==
> 'application/octet-stream',
> + msg.walk()))
I think Guido got rid of all uses of get_content_type, so that this code
could be used with an older email pkg.
From skip@pobox.com Mon Sep 23 13:57:47 2002
From: skip@pobox.com (Skip Montanaro)
Date: Mon, 23 Sep 2002 07:57:47 -0500
Subject: [Spambayes-checkins] spambayes
Options.py,1.23,1.24tokenizer.py,1.30,1.31
In-Reply-To:
References:
Message-ID: <15759.4043.426296.579486@12-248-11-90.client.attbi.com>
>>>>> "Tim" == Tim Peters writes:
>> + def octetparts(msg):
>> + return Set(filter(lambda part:
>> + part.get_content_type() ==
>> 'application/octet-stream',
>> + msg.walk()))
Tim> I think Guido got rid of all uses of get_content_type, so that this
Tim> code could be used with an older email pkg.
What is the correct replacement, part.get_type()?
Skip
From guido@python.org Mon Sep 23 14:18:36 2002
From: guido@python.org (Guido van Rossum)
Date: Mon, 23 Sep 2002 09:18:36 -0400
Subject: [Spambayes-checkins] spambayes
Options.py,1.23,1.24tokenizer.py,1.30,1.31
In-Reply-To: Your message of "Mon, 23 Sep 2002 07:57:47 CDT."
<15759.4043.426296.579486@12-248-11-90.client.attbi.com>
References:
<15759.4043.426296.579486@12-248-11-90.client.attbi.com>
Message-ID: <200209231318.g8NDIaQ06599@pcp02138704pcs.reston01.va.comcast.net>
> >>>>> "Tim" == Tim Peters writes:
>
> >> + def octetparts(msg):
> >> + return Set(filter(lambda part:
> >> + part.get_content_type() ==
> >> 'application/octet-stream',
> >> + msg.walk()))
>
>
> Tim> I think Guido got rid of all uses of get_content_type, so that this
> Tim> code could be used with an older email pkg.
>
> What is the correct replacement, part.get_type()?
Since you're only comparing it with app/oct-str, yes.
--Guido van Rossum (home page: http://www.python.org/~guido/)
From barry@users.sourceforge.net Mon Sep 23 14:30:45 2002
From: barry@users.sourceforge.net (Barry Warsaw)
Date: Mon, 23 Sep 2002 06:30:45 -0700
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.31,1.32
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv25327
Modified Files:
tokenizer.py
Log Message:
Use the email 2.3 API, get_type() and friends -> get_content_type()
and friends. The latter always returns a content type string, never
None.
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.31
retrieving revision 1.32
diff -C2 -d -r1.31 -r1.32
*** tokenizer.py 23 Sep 2002 03:13:31 -0000 1.31
--- tokenizer.py 23 Sep 2002 13:30:42 -0000 1.32
***************
*** 537,541 ****
text.add(htmlpart)
! elif part.get_main_type('text') == 'text':
text.add(part)
--- 537,541 ----
text.add(htmlpart)
! elif part.get_content_maintype() == 'text':
text.add(part)
***************
*** 546,550 ****
# have redundant content, so it goes.
def textparts(msg):
! return Set(filter(lambda part: part.get_main_type('text') == 'text',
msg.walk()))
--- 546,550 ----
# have redundant content, so it goes.
def textparts(msg):
! return Set(filter(lambda part: part.get_content_maintype() == 'text',
msg.walk()))
***************
*** 716,722 ****
def crack_content_xyz(msg):
! x = msg.get_type()
! if x is not None:
! yield 'content-type:' + x.lower()
x = msg.get_param('type')
--- 716,720 ----
def crack_content_xyz(msg):
! yield 'content-type:' + msg.get_content_type()
x = msg.get_param('type')
***************
*** 1036,1040 ****
# Remove HTML/XML tags.
! if (part.get_type() == "text/plain" or
not options.retain_pure_html_tags):
text = html_re.sub(' ', text)
--- 1034,1038 ----
# Remove HTML/XML tags.
! if (part.get_content_type() == "text/plain" or
not options.retain_pure_html_tags):
text = html_re.sub(' ', text)
From bkc@users.sourceforge.net Mon Sep 23 14:55:21 2002
From: bkc@users.sourceforge.net (Brad Clements)
Date: Mon, 23 Sep 2002 06:55:21 -0700
Subject: [Spambayes-checkins] spambayes cmp.py,1.9,1.10
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv12279
Modified Files:
cmp.py
Log Message:
added mean and sdev reporting, and delta reporting
Index: cmp.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/cmp.py,v
retrieving revision 1.9
retrieving revision 1.10
diff -C2 -d -r1.9 -r1.10
*** cmp.py 19 Sep 2002 10:25:31 -0000 1.9
--- cmp.py 23 Sep 2002 13:55:18 -0000 1.10
***************
*** 22,35 ****
def suck(f):
fns = []
! fps = []
get = f.readline
while 1:
line = get()
! if line.startswith('-> tested'):
print line,
if line.startswith('-> '):
continue
if line.startswith('total'):
! break
# A line with an f-p rate and an f-n rate.
p, n = map(float, line.split())
--- 22,55 ----
def suck(f):
fns = []
! fps = []
! hamdev = []
! spamdev = []
!
get = f.readline
while 1:
line = get()
! if line.startswith('-> tested'):
print line,
+ if line.find('sample sdev') != -1:
+ vals = line.split(';')
+ mean = float(vals[1].split(' ')[-1])
+ sdev = float(vals[2].split(' ')[-1])
+ val = (mean,sdev)
+ typ = vals[0].split(' ')[2]
+ if line.find('for all runs') != -1:
+ if typ == 'Ham':
+ hamdevall = val
+ else:
+ spamdevall = val
+ elif line.find('all in this') != -1:
+ if typ == 'Ham':
+ hamdev.append(val)
+ else:
+ spamdev.append(val)
+ continue
if line.startswith('-> '):
continue
if line.startswith('total'):
! break
# A line with an f-p rate and an f-n rate.
p, n = map(float, line.split())
***************
*** 45,53 ****
fpmean = float(get().split()[-1])
fnmean = float(get().split()[-1])
! return fps, fns, fptot, fntot, fpmean, fnmean
def tag(p1, p2):
if p1 == p2:
! t = "tied"
else:
t = p1 < p2 and "lost " or "won "
--- 65,73 ----
fpmean = float(get().split()[-1])
fnmean = float(get().split()[-1])
! return fps, fns, fptot, fntot, fpmean, fnmean, hamdev, spamdev,hamdevall,spamdevall
def tag(p1, p2):
if p1 == p2:
! t = "tied "
else:
t = p1 < p2 and "lost " or "won "
***************
*** 58,62 ****
t += " +(was 0)"
return t
!
def dump(p1s, p2s):
alltags = ""
--- 78,93 ----
t += " +(was 0)"
return t
!
! def mtag(m1,m2):
! mean1,dev1 = m1
! mean2,dev2 = m2
! mp = (mean2 - mean1) * 100.0 / mean1
! dp = (dev2 - dev1) * 100.0 / dev1
!
! return "%2.2f %2.2f (%+2.2f%%) %2.2f %2.2f (%+2.2f%%)" % (
! mean1,mean2,mp,
! dev1,dev2,dp
! )
!
def dump(p1s, p2s):
alltags = ""
***************
*** 69,72 ****
--- 100,107 ----
print "%-4s %2d times" % (t, alltags.count(t))
print
+
+ def dumpdev(meandev1,meandev2):
+ for m1,m2 in zip(meandev1,meandev2):
+ print mtag(m1, m2)
def windowsfy(fn):
***************
*** 83,88 ****
f2n = windowsfy(f2n)
! fp1, fn1, fptot1, fntot1, fpmean1, fnmean1 = suck(file(f1n))
! fp2, fn2, fptot2, fntot2, fpmean2, fnmean2 = suck(file(f2n))
print
--- 118,123 ----
f2n = windowsfy(f2n)
! fp1, fn1, fptot1, fntot1, fpmean1, fnmean1,hamdev1,spamdev1,hamdevall1,spamdevall1 = suck(file(f1n))
! fp2, fn2, fptot2, fntot2, fpmean2, fnmean2,hamdev2,spamdev2,hamdevall2,spamdevall2 = suck(file(f2n))
print
***************
*** 97,98 ****
--- 132,151 ----
print "total unique fn went from", fntot1, "to", fntot2, tag(fntot1, fntot2)
print "mean fn % went from", fnmean1, "to", fnmean2, tag(fnmean1, fnmean2)
+
+ print
+ print "ham mean ham sdev"
+ dumpdev(hamdev1,hamdev2)
+ print
+ print "ham mean and sdev for all runs"
+ dumpdev([hamdevall1],[hamdevall2])
+
+ print
+ print "spam mean spam sdev"
+ dumpdev(spamdev1,spamdev2)
+ print
+ print "spam mean and sdev for all runs"
+ dumpdev([spamdevall1],[spamdevall2])
+ print
+ diff1 = spamdevall1[0] - hamdevall1[0]
+ diff2 = spamdevall2[0] - hamdevall2[0]
+ print "ham/spam mean difference: %2.2f %2.2f %+2.2f" % (diff1,diff2,(diff2-diff1))
From bkc@users.sourceforge.net Mon Sep 23 14:56:16 2002
From: bkc@users.sourceforge.net (Brad Clements)
Date: Mon, 23 Sep 2002 06:56:16 -0700
Subject: [Spambayes-checkins] spambayes TestDriver.py,1.6,1.7
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv13218
Modified Files:
TestDriver.py
Log Message:
changed mean and sdev output, added -> prefix for capture by rates.py and cmp.py
Index: TestDriver.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** TestDriver.py 22 Sep 2002 08:31:48 -0000 1.6
--- TestDriver.py 23 Sep 2002 13:56:14 -0000 1.7
***************
*** 90,98 ****
def printhist(tag, ham, spam):
print
! print "Ham distribution for", tag
ham.display()
print
! print "Spam distribution for", tag
spam.display()
--- 90,98 ----
def printhist(tag, ham, spam):
print
! print "-> Ham distribution for", tag,
ham.display()
print
! print "-> Spam distribution for", tag,
spam.display()
From montanaro@users.sourceforge.net Mon Sep 23 15:38:44 2002
From: montanaro@users.sourceforge.net (Skip Montanaro)
Date: Mon, 23 Sep 2002 07:38:44 -0700
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.32,1.33
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv14069
Modified Files:
tokenizer.py
Log Message:
replace get_content_type() with get_type() to allow running under 2.2.x
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.32
retrieving revision 1.33
diff -C2 -d -r1.32 -r1.33
*** tokenizer.py 23 Sep 2002 13:30:42 -0000 1.32
--- tokenizer.py 23 Sep 2002 14:38:41 -0000 1.33
***************
*** 551,555 ****
def octetparts(msg):
return Set(filter(lambda part:
! part.get_content_type() == 'application/octet-stream',
msg.walk()))
--- 551,555 ----
def octetparts(msg):
return Set(filter(lambda part:
! part.get_type() == 'application/octet-stream',
msg.walk()))
From richiehindle@users.sourceforge.net Mon Sep 23 20:41:01 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Mon, 23 Sep 2002 12:41:01 -0700
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.2,1.3
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv4313
Modified Files:
pop3proxy.py
Log Message:
Fixed a bug whereby your email client would see no traffic for ages, and hence potentially time out, when huge emails were proxy'd. It now reads for 30 seconds, and if the message is still arriving it classifies it based on what it's seen so far and starts returning it to the email client.
Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** pop3proxy.py 18 Sep 2002 22:01:39 -0000 1.2
--- pop3proxy.py 23 Sep 2002 19:40:58 -0000 1.3
***************
*** 26,30 ****
"""
! import sys, re, operator, errno, getopt, cPickle, socket, asyncore, asynchat
import classifier, tokenizer, hammie
--- 26,39 ----
"""
! # This module is part of the spambayes project, which is Copyright 2002
! # The Python Software Foundation and is covered by the Python Software
! # Foundation license.
!
! __author__ = "Richie Hindle "
! __credits__ = "Tim Peters, Neale Pickett, all the spambayes contributors."
!
!
! import sys, re, operator, errno, getopt, cPickle, time
! import socket, asyncore, asynchat
import classifier, tokenizer, hammie
***************
*** 76,80 ****
asynchat.async_chat.__init__(self, clientSocket)
self.request = ''
- self.isClosing = False
self.set_terminator('\r\n')
serverSocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
--- 85,88 ----
***************
*** 110,117 ****
def readResponse(self, command, args):
! """Reads the POP3 server's response. Also sets self.isClosing
! to True if the server closes the socket, which tells
! found_terminator() to close when the response has been sent.
"""
isMulti = self.isMultiline(command, args)
responseLines = []
--- 118,131 ----
def readResponse(self, command, args):
! """Reads the POP3 server's response and returns a tuple of
! (response, isClosing, timedOut). isClosing is True if the
! server closes the socket, which tells found_terminator() to
! close when the response has been sent. timedOut is set if the
! request was still arriving after 30 seconds, and tells
! found_terminator() to proxy the remainder of the response.
"""
+ isClosing = False
+ timedOut = False
+ startTime = time.time()
isMulti = self.isMultiline(command, args)
responseLines = []
***************
*** 121,125 ****
if not line:
# The socket's been closed by the server, probably by QUIT.
! self.isClosing = True
break
elif not isMulti or (isFirstLine and line.startswith('-ERR')):
--- 135,139 ----
if not line:
# The socket's been closed by the server, probably by QUIT.
! isClosing = True
break
elif not isMulti or (isFirstLine and line.startswith('-ERR')):
***************
*** 135,141 ****
responseLines.append(line)
isFirstLine = False
! return ''.join(responseLines)
def collect_incoming_data(self, data):
--- 149,161 ----
responseLines.append(line)
+ # Time out after 30 seconds - found_terminator() knows how
+ # to deal with this.
+ if time.time() > startTime + 30:
+ timedOut = True
+ break
+
isFirstLine = False
! return ''.join(responseLines), isClosing, timedOut
def collect_incoming_data(self, data):
***************
*** 146,155 ****
"""Asynchat override."""
# Send the request to the server and read the reply.
- # XXX When the response is huge, the email client can time out.
- # It should read as much as it can from the server, then if the
- # response is still coming after say 30 seconds, it should
- # classify the message based on that and send back the headers
- # and the body so far. Then it should become a simple
- # one-packet-at-a-time proxy for the rest of the response.
if self.request.strip().upper() == 'KILL':
self.serverFile.write('QUIT\r\n')
--- 166,169 ----
***************
*** 168,172 ****
command = splitCommand[0].upper()
args = splitCommand[1:]
! rawResponse = self.readResponse(command, args)
# Pass the request and the raw response to the subclass and
--- 182,186 ----
command = splitCommand[0].upper()
args = splitCommand[1:]
! rawResponse, isClosing, timedOut = self.readResponse(command, args)
# Pass the request and the raw response to the subclass and
***************
*** 176,184 ****
self.request = ''
! # If readResponse() decided that the server had closed its
! # socket, close this one when the response has been sent.
! if self.isClosing:
! self.close_when_done()
def handle_error(self):
"""Let SystemExit cause an exit."""
--- 190,216 ----
self.request = ''
! # If readResponse() timed out, we still need to read and proxy
! # the rest of the message.
! if timedOut:
! while True:
! line = self.serverFile.readline()
! if not line:
! # The socket's been closed by the server.
! isClosing = True
! break
! elif line == '.\r\n':
! # The termination line.
! self.push(line)
! break
! else:
! # A normal line.
! self.push(line)
+ # If readResponse() or the loop above decided that the server
+ # has closed its socket, close this one when the response has
+ # been sent.
+ if isClosing:
+ self.close_when_done()
+
def handle_error(self):
"""Let SystemExit cause an exit."""
***************
*** 492,496 ****
def runProxy():
! bayes = hammie.createbayes()
BayesProxyListener('localhost', 8110, 8111, bayes)
bayes.learn(tokenizer.tokenize(spam1), True)
--- 524,529 ----
def runProxy():
! # Name the database in case it ever gets auto-flushed to disk.
! bayes = hammie.createbayes('_pop3proxy.db')
BayesProxyListener('localhost', 8110, 8111, bayes)
bayes.learn(tokenizer.tokenize(spam1), True)
From tim_one@users.sourceforge.net Mon Sep 23 21:03:09 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Mon, 23 Sep 2002 13:03:09 -0700
Subject: [Spambayes-checkins] spambayes msgs.py,NONE,1.1 README.txt,1.22,1.23
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv12319
Modified Files:
README.txt
Added Files:
msgs.py
Log Message:
Preparing to refactor my test drivers.
--- NEW FILE: msgs.py ---
import os
import random
HAMKEEP = None
SPAMKEEP = None
SEED = random.randrange(2000000000)
class Msg(object):
__slots__ = 'tag', 'guts'
def __init__(self, dir, name):
path = dir + "/" + name
self.tag = path
f = open(path, 'rb')
self.guts = f.read()
f.close()
def __iter__(self):
return tokenize(self.guts)
# Compare msgs by their paths; this is appropriate for sets of msgs.
def __hash__(self):
return hash(self.tag)
def __eq__(self, other):
return self.tag == other.tag
def __str__(self):
return self.guts
# The iterator yields a stream of Msg objects, taken from a list of directories.
class MsgStream(object):
__slots__ = 'tag', 'directories', 'keep'
def __init__(self, tag, directories, keep=None):
self.tag = tag
self.directories = directories
self.keep = keep
def __str__(self):
return self.tag
def produce(self):
if self.keep is None:
for directory in self.directories:
for fname in os.listdir(directory):
yield Msg(directory, fname)
return
# We only want part of the msgs. Shuffle each directory list, but
# in such a way that we'll get the same result each time this is
# called on the same directory list.
for directory in self.directories:
all = os.listdir(directory)
random.seed(hash(max(all)) ^ SEED) # reproducible across calls
random.shuffle(all)
del all[self.keep:]
all.sort() # seems to speed access on Win98!
for fname in all:
yield Msg(directory, fname)
def __iter__(self):
return self.produce()
class HamStream(MsgStream):
def __init__(self, tag, directories):
MsgStream.__init__(self, tag, directories, HAMKEEP)
class SpamStream(MsgStream):
def __init__(self, tag, directories):
MsgStream.__init__(self, tag, directories, SPAMKEEP)
def setparms(hamkeep, spamkeep, seed=None):
"""Set HAMKEEP and SPAMKEEP. If seed is not None, also set SEED."""
global HAMKEEP, SPAMKEEP, SEED
HAMKEEP, SPAMKEEP = hamkeep, spamkeep
if seed is not None:
SEED = seed
Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.22
retrieving revision 1.23
diff -C2 -d -r1.22 -r1.23
*** README.txt 22 Sep 2002 04:59:54 -0000 1.22
--- README.txt 23 Sep 2002 20:03:06 -0000 1.23
***************
*** 60,63 ****
--- 60,67 ----
cmp.py below.
+ msgs.py
+ Some simple classes to wrap raw msgs, and to produce streams of
+ msgs. The test drivers use these.
+
Apps
From tim_one@users.sourceforge.net Mon Sep 23 21:18:36 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Mon, 23 Sep 2002 13:18:36 -0700
Subject: [Spambayes-checkins]
spambayes msgs.py,1.1,1.2 timtest.py,1.27,1.28 timcv.py,1.7,1.8
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv16552
Modified Files:
msgs.py timtest.py timcv.py
Log Message:
Refactored my c-v and grid test drivers to cut code duplication. Of
course this created more duplication too .
The particular reason for upgrading the grid driver is that the c-v
driver really can't be used with Gary Robinson's central-limit approach:
incrementally updating a classifier given the three-pass training
procedure needed looks *hard*. The grid driver doesn't try to
incrementally change the classifiers it builds.
Index: msgs.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/msgs.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** msgs.py 23 Sep 2002 20:03:05 -0000 1.1
--- msgs.py 23 Sep 2002 20:18:34 -0000 1.2
***************
*** 2,5 ****
--- 2,7 ----
import random
+ from tokenizer import tokenize
+
HAMKEEP = None
SPAMKEEP = None
***************
*** 29,33 ****
return self.guts
! # The iterator yields a stream of Msg objects, taken from a list of directories.
class MsgStream(object):
__slots__ = 'tag', 'directories', 'keep'
--- 31,36 ----
return self.guts
! # The iterator yields a stream of Msg objects, taken from a list of
! # directories.
class MsgStream(object):
__slots__ = 'tag', 'directories', 'keep'
Index: timtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtest.py,v
retrieving revision 1.27
retrieving revision 1.28
diff -C2 -d -r1.27 -r1.28
*** timtest.py 22 Sep 2002 06:58:36 -0000 1.27
--- timtest.py 23 Sep 2002 20:18:34 -0000 1.28
***************
*** 1,9 ****
#! /usr/bin/env python
- # At the moment, this requires Python 2.3 from CVS (heapq, Set, enumerate).
# A test driver using "the standard" test directory structure. See also
! # rates.py and cmp.py for summarizing results.
! """Usage: %(program)s [-h] -n nsets
Where:
--- 1,9 ----
#! /usr/bin/env python
# A test driver using "the standard" test directory structure. See also
! # rates.py and cmp.py for summarizing results. This runs an NxN test grid,
! # skipping the diagonal.
! """Usage: %(program)s [options] -n nsets
Where:
***************
*** 14,17 ****
--- 14,32 ----
This is required.
+ If you only want to use some of the messages in each set,
+
+ --ham-keep int
+ The maximum number of msgs to use from each Ham set. The msgs are
+ chosen randomly. See also the -s option.
+
+ --spam-keep int
+ The maximum number of msgs to use from each Spam set. The msgs are
+ chosen randomly. See also the -s option.
+
+ -s int
+ A seed for the random number generator. Has no effect unless
+ at least on of {--ham-keep, --spam-keep} is specified. If -s
+ isn't specifed, the seed is taken from current time.
+
In addition, an attempt is made to merge bayescustomize.ini into the options.
If that exists, it can be used to change the settings in Options.options.
***************
*** 20,29 ****
from __future__ import generators
- import os
import sys
from Options import options
! from tokenizer import tokenize
! from TestDriver import Driver
program = sys.argv[0]
--- 35,43 ----
from __future__ import generators
import sys
from Options import options
! import TestDriver
! import msgs
program = sys.argv[0]
***************
*** 37,85 ****
sys.exit(code)
- class Msg(object):
- def __init__(self, dir, name):
- path = dir + "/" + name
- self.tag = path
- f = open(path, 'rb')
- guts = f.read()
- f.close()
- self.guts = guts
-
- def __iter__(self):
- return tokenize(self.guts)
-
- def __hash__(self):
- return hash(self.tag)
-
- def __eq__(self, other):
- return self.tag == other.tag
-
- def __str__(self):
- return self.guts
-
- class MsgStream(object):
- def __init__(self, directory):
- self.directory = directory
-
- def __str__(self):
- return self.directory
-
- def produce(self):
- directory = self.directory
- for fname in os.listdir(directory):
- yield Msg(directory, fname)
-
- def xproduce(self):
- import random
- directory = self.directory
- all = os.listdir(directory)
- random.seed(hash(directory))
- random.shuffle(all)
- for fname in all[-1500:-1300:]:
- yield Msg(directory, fname)
-
- def __iter__(self):
- return self.produce()
-
def drive(nsets):
print options.display()
--- 51,54 ----
***************
*** 89,112 ****
spamhamdirs = zip(spamdirs, hamdirs)
! d = Driver()
for spamdir, hamdir in spamhamdirs:
d.new_classifier()
! d.train(MsgStream(hamdir), MsgStream(spamdir))
for sd2, hd2 in spamhamdirs:
if (sd2, hd2) == (spamdir, hamdir):
continue
! d.test(MsgStream(hd2), MsgStream(sd2))
d.finishtest()
d.alldone()
! if __name__ == "__main__":
import getopt
try:
! opts, args = getopt.getopt(sys.argv[1:], 'hn:')
except getopt.error, msg:
usage(1, msg)
! nsets = None
for opt, arg in opts:
if opt == '-h':
--- 58,84 ----
spamhamdirs = zip(spamdirs, hamdirs)
! d = TestDriver.Driver()
for spamdir, hamdir in spamhamdirs:
d.new_classifier()
! d.train(msgs.HamStream(hamdir, [hamdir]),
! msgs.SpamStream(spamdir, [spamdir]))
for sd2, hd2 in spamhamdirs:
if (sd2, hd2) == (spamdir, hamdir):
continue
! d.test(msgs.HamStream(hd2, [hd2]),
! msgs.SpamStream(sd2, [sd2]))
d.finishtest()
d.alldone()
! def main():
import getopt
try:
! opts, args = getopt.getopt(sys.argv[1:], 'hn:s:',
! ['ham-keep=', 'spam-keep='])
except getopt.error, msg:
usage(1, msg)
! nsets = seed = hamkeep = spamkeep = None
for opt, arg in opts:
if opt == '-h':
***************
*** 114,117 ****
--- 86,95 ----
elif opt == '-n':
nsets = int(arg)
+ elif opt == '-s':
+ seed = int(arg)
+ elif opt == '--ham-keep':
+ hamkeep = int(arg)
+ elif opt == '--spam-keep':
+ spamkeep = int(arg)
if args:
***************
*** 120,122 ****
--- 98,104 ----
usage(1, "-n is required")
+ msgs.setparms(hamkeep, spamkeep, seed)
drive(nsets)
+
+ if __name__ == "__main__":
+ main()
Index: timcv.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timcv.py,v
retrieving revision 1.7
retrieving revision 1.8
diff -C2 -d -r1.7 -r1.8
*** timcv.py 22 Sep 2002 06:58:36 -0000 1.7
--- timcv.py 23 Sep 2002 20:18:34 -0000 1.8
***************
*** 1,4 ****
#! /usr/bin/env python
- # At the moment, this requires Python 2.3 from CVS (heapq, Set, enumerate).
# A driver for N-fold cross validation.
--- 1,3 ----
***************
*** 34,48 ****
from __future__ import generators
- import os
import sys
- import random
from Options import options
- from tokenizer import tokenize
import TestDriver
!
! HAMKEEP = None
! SPAMKEEP = None
! SEED = random.randrange(2000000000)
program = sys.argv[0]
--- 33,41 ----
from __future__ import generators
import sys
from Options import options
import TestDriver
! import msgs
program = sys.argv[0]
***************
*** 56,122 ****
sys.exit(code)
- class Msg(object):
- __slots__ = 'tag', 'guts'
-
- def __init__(self, dir, name):
- path = dir + "/" + name
- self.tag = path
- f = open(path, 'rb')
- self.guts = f.read()
- f.close()
-
- def __iter__(self):
- return tokenize(self.guts)
-
- # Compare msgs by their paths; this is appropriate for sets of msgs.
- def __hash__(self):
- return hash(self.tag)
-
- def __eq__(self, other):
- return self.tag == other.tag
-
- def __str__(self):
- return self.guts
-
- class MsgStream(object):
- __slots__ = 'tag', 'directories', 'keep'
-
- def __init__(self, tag, directories, keep=None):
- self.tag = tag
- self.directories = directories
- self.keep = keep
-
- def __str__(self):
- return self.tag
-
- def produce(self):
- if self.keep is None:
- for directory in self.directories:
- for fname in os.listdir(directory):
- yield Msg(directory, fname)
- return
- # We only want part of the msgs. Shuffle each directory list, but
- # in such a way that we'll get the same result each time this is
- # called on the same directory list.
- for directory in self.directories:
- all = os.listdir(directory)
- random.seed(hash(max(all)) ^ SEED) # reproducible across calls
- random.shuffle(all)
- del all[self.keep:]
- all.sort() # seems to speed access on Win98!
- for fname in all:
- yield Msg(directory, fname)
-
- def __iter__(self):
- return self.produce()
-
- class HamStream(MsgStream):
- def __init__(self, tag, directories):
- MsgStream.__init__(self, tag, directories, HAMKEEP)
-
- class SpamStream(MsgStream):
- def __init__(self, tag, directories):
- MsgStream.__init__(self, tag, directories, SPAMKEEP)
-
def drive(nsets):
print options.display()
--- 49,52 ----
***************
*** 127,132 ****
d = TestDriver.Driver()
# Train it on all sets except the first.
! d.train(HamStream("%s-%d" % (hamdirs[1], nsets), hamdirs[1:]),
! SpamStream("%s-%d" % (spamdirs[1], nsets), spamdirs[1:]))
# Now run nsets times, predicting pair i against all except pair i.
--- 57,62 ----
d = TestDriver.Driver()
# Train it on all sets except the first.
! d.train(msgs.HamStream("%s-%d" % (hamdirs[1], nsets), hamdirs[1:]),
! msgs.SpamStream("%s-%d" % (spamdirs[1], nsets), spamdirs[1:]))
# Now run nsets times, predicting pair i against all except pair i.
***************
*** 134,139 ****
h = hamdirs[i]
s = spamdirs[i]
! hamstream = HamStream(h, [h])
! spamstream = SpamStream(s, [s])
if i > 0:
--- 64,69 ----
h = hamdirs[i]
s = spamdirs[i]
! hamstream = msgs.HamStream(h, [h])
! spamstream = msgs.SpamStream(s, [s])
if i > 0:
***************
*** 152,156 ****
def main():
- global SEED, HAMKEEP, SPAMKEEP
import getopt
--- 82,85 ----
***************
*** 161,165 ****
usage(1, msg)
! nsets = seed = None
for opt, arg in opts:
if opt == '-h':
--- 90,94 ----
usage(1, msg)
! nsets = seed = hamkeep = spamkeep = None
for opt, arg in opts:
if opt == '-h':
***************
*** 170,176 ****
seed = int(arg)
elif opt == '--ham-keep':
! HAMKEEP = int(arg)
elif opt == '--spam-keep':
! SPAMKEEP = int(arg)
if args:
--- 99,105 ----
seed = int(arg)
elif opt == '--ham-keep':
! hamkeep = int(arg)
elif opt == '--spam-keep':
! spamkeep = int(arg)
if args:
***************
*** 178,184 ****
if nsets is None:
usage(1, "-n is required")
- if seed is not None:
- SEED = seed
drive(nsets)
--- 107,112 ----
if nsets is None:
usage(1, "-n is required")
+ msgs.setparms(hamkeep, spamkeep, seed)
drive(nsets)
From tim_one@users.sourceforge.net Mon Sep 23 22:19:10 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Mon, 23 Sep 2002 14:19:10 -0700
Subject: [Spambayes-checkins] spambayes Options.py,1.24,1.25
TestDriver.py,1.7,1.8classifier.py,1.17,1.18
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv9358
Modified Files:
Options.py TestDriver.py classifier.py
Log Message:
New option Classifier/use_central_limit. Read the comments in Options.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.24
retrieving revision 1.25
diff -C2 -d -r1.24 -r1.25
*** Options.py 23 Sep 2002 03:13:30 -0000 1.24
--- Options.py 23 Sep 2002 21:19:08 -0000 1.25
***************
*** 185,188 ****
--- 185,200 ----
# want a higher spam_cutoff.
robinson_minimum_prob_strength: 0.0
+
+ ###########################################################################
+ # More speculative options for Gary Robinson's central-limit. These may go
+ # away, or a bunch of incompatible stuff above may go away.
+
+ # Use a central-limit approach for scoring.
+ # The number of extremes to use is given by max_discriminators (above).
+ # spam_cutoff should almost certainly be exactly 0.5 when using this approach.
+ # DO NOT run cross-validation tests when this is enabled! They'll deliver
+ # nonense, or, if you're lucky, will blow up with division by 0 or negative
+ # square roots. An NxN test grid should work fine.
+ use_central_limit: False
"""
***************
*** 230,233 ****
--- 242,247 ----
'use_robinson_ranking': boolean_cracker,
'robinson_minimum_prob_strength': float_cracker,
+
+ 'use_central_limit': boolean_cracker,
},
}
Index: TestDriver.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v
retrieving revision 1.7
retrieving revision 1.8
diff -C2 -d -r1.7 -r1.8
*** TestDriver.py 23 Sep 2002 13:56:14 -0000 1.7
--- TestDriver.py 23 Sep 2002 21:19:08 -0000 1.8
***************
*** 124,127 ****
--- 124,129 ----
self.trained_spam_hist = Hist(options.nbuckets)
+ # CAUTION: this just doesn't work for incrememental training when
+ # options.use_central_limit is in effect.
def train(self, ham, spam):
print "-> Training on", ham, "&", spam, "...",
***************
*** 130,134 ****
--- 132,140 ----
self.tester.train(ham, spam)
print c.nham - nham, "hams &", c.nspam- nspam, "spams"
+ c.compute_population_stats(ham, False)
+ c.compute_population_stats(spam, True)
+ # CAUTION: this just doesn't work for incrememental training when
+ # options.use_central_limit is in effect.
def untrain(self, ham, spam):
print "-> Forgetting", ham, "&", spam, "...",
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.17
retrieving revision 1.18
diff -C2 -d -r1.17 -r1.18
*** classifier.py 21 Sep 2002 20:25:49 -0000 1.17
--- classifier.py 23 Sep 2002 21:19:08 -0000 1.18
***************
*** 221,224 ****
--- 221,248 ----
'nspam', # number of spam messages learn() has seen
'nham', # number of non-spam messages learn() has seen
+
+ # The rest is unique to the central-limit code.
+ # n is the # of data points in the population.
+ # sum is the sum of the probabilities, and is a long scaled
+ # by 2**64.
+ # sumsq is the sum of the squares of the probabilities, and
+ # is a long scaled by 2**128.
+ # mean is the mean probability of the population, as an
+ # unscaled float.
+ # var is the variance of the population, as unscaled float.
+ # There's one set of these for the spam population, and
+ # another for the ham population.
+ # XXX If this code survives, clean it up.
+ 'spamn',
+ 'spamsum',
+ 'spamsumsq',
+ 'spammean',
+ 'spamvar',
+
+ 'hamn',
+ 'hamsum',
+ 'hamsumsq',
+ 'hammean',
+ 'hamvar',
)
***************
*** 226,229 ****
--- 250,256 ----
self.wordinfo = {}
self.nspam = self.nham = 0
+ self.spamn = self.hamn = 0
+ self.spamsum = self.spamsumsq = 0
+ self.hamsum = self.hamsumsq = 0
def __getstate__(self):
***************
*** 451,454 ****
--- 478,511 ----
del self.wordinfo[word]
+ def compute_population_stats(self, msgstream, is_spam):
+ pass
+
+ # XXX More stuff should be reworked to use this as a helper function.
+ def _getclues(self, wordstream):
+ # A priority queue to remember the MAX_DISCRIMINATORS best
+ # probabilities, where "best" means largest distance from 0.5.
+ # The tuples are (distance, prob, word, record).
+ nbest = [(-1.0, None, None, None)] * options.max_discriminators
+ smallest_best = -1.0
+
+ wordinfoget = self.wordinfo.get
+ now = time.time()
+ for word in Set(wordstream):
+ record = wordinfoget(word)
+ if record is None:
+ prob = UNKNOWN_SPAMPROB
+ else:
+ record.atime = now
+ prob = record.spamprob
+
+ distance = abs(prob - 0.5)
+ if distance > smallest_best:
+ heapreplace(nbest, (distance, prob, word, record))
+ smallest_best = nbest[0][0]
+
+ clues = [(prob, word, record)
+ for distance, prob, word, record in nbest
+ if prob is not None]
+ return clues
#************************************************************************
***************
*** 599,603 ****
self.wordinfo[word] = record
-
if options.use_robinson_probability:
update_probabilities = robinson_update_probabilities
--- 656,744 ----
self.wordinfo[word] = record
if options.use_robinson_probability:
update_probabilities = robinson_update_probabilities
+
+ def central_limit_compute_population_stats(self, msgstream, is_spam):
+ from math import ldexp
+
+ sum = sumsq = 0
+ seen = {}
+ for msg in msgstream:
+ for prob, word, record in self._getclues(msg):
+ if word in seen:
+ continue
+ seen[word] = 1
+ prob = long(ldexp(prob, 64))
+ sum += prob
+ sumsq += prob * prob
+ n = len(seen)
+
+ if is_spam:
+ self.spamn, self.spamsum, self.spamsumsq = n, sum, sumsq
+ spamsum = self.spamsum
+ self.spammean = ldexp(spamsum, -64) / self.spamn
+ spamvar = self.spamsumsq * self.spamn - spamsum**2
+ self.spamvar = ldexp(spamvar, -128) / (self.spamn ** 2)
+ print 'spammean', self.spammean, 'spamvar', self.spamvar
+ else:
+ self.hamn, self.hamsum, self.hamsumsq = n, sum, sumsq
+ hamsum = self.hamsum
+ self.hammean = ldexp(hamsum, -64) / self.hamn
+ hamvar = self.hamsumsq * self.hamn - hamsum**2
+ self.hamvar = ldexp(hamvar, -128) / (self.hamn ** 2)
+ print 'hammean', self.hammean, 'hamvar', self.hamvar
+
+ if options.use_central_limit:
+ compute_population_stats = central_limit_compute_population_stats
+
+ def central_limit_spamprob(self, wordstream, evidence=False):
+ """Return best-guess probability that wordstream is spam.
+
+ wordstream is an iterable object producing words.
+ The return value is a float in [0.0, 1.0].
+
+ If optional arg evidence is True, the return value is a pair
+ probability, evidence
+ where evidence is a list of (word, probability) pairs.
+ """
+
+ from math import sqrt
+
+ clues = self._getclues(wordstream)
+ sum = 0.0
+ for prob, word, record in clues:
+ sum += prob
+ if record is not None:
+ record.killcount += 1
+ n = len(clues)
+ if n == 0:
+ return 0.5
+ mean = sum / n
+
+ # If this sample is drawn from the spam population, its mean is
+ # distributed around spammean with variance spamvar/n. Likewise
+ # for if it's drawn from the ham population. Compute a normalized
+ # z-score (how many stddevs is it away from the population mean?)
+ # against both populations, and then it's ham or spam depending
+ # on which population it matches better.
+ zham = (mean - self.hammean) / sqrt(self.hamvar / n)
+ zspam = (mean - self.spammean) / sqrt(self.spamvar / n)
+ stat = abs(zham) - abs(zspam) # > 0 for spam, < 0 for ham
+
+ # Normalize into [0, 1]. I'm arbitrarily clipping it to fit in
+ # [-20, 20] first. 20 is a massive z-score difference.
+ if stat < -20.0:
+ stat = -20.0
+ elif stat > 20.0:
+ stat = 20.0
+ stat = 0.5 + stat / 40.0
+
+ if evidence:
+ clues = [(word, prob) for prob, word, record in clues]
+ clues.sort(lambda a, b: cmp(a[1], b[1]))
+ return stat, clues
+ else:
+ return stat
+
+ if options.use_central_limit:
+ spamprob = central_limit_spamprob
From tim_one@users.sourceforge.net Mon Sep 23 22:20:13 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Mon, 23 Sep 2002 14:20:13 -0700
Subject: [Spambayes-checkins]
spambayes Options.py,1.25,1.26 cdb.py,1.3,1.4 cmp.py,1.10,1.11
hammie.py,1.19,1.20 hammiesrv.py,1.1,1.2 loosecksum.py,1.2,1.3
mboxtest.py,1.8,1.9 msgs.py,1.2,1.3 pop3proxy.py,1.3,1.4
setup.py,1.3,1.4 splitndirs.py,1.3,1.4 unheader.py,1.2,1.3
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv9970
Modified Files:
Options.py cdb.py cmp.py hammie.py hammiesrv.py loosecksum.py
mboxtest.py msgs.py pop3proxy.py setup.py splitndirs.py
unheader.py
Log Message:
Whitespace normalization.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.25
retrieving revision 1.26
diff -C2 -d -r1.25 -r1.26
*** Options.py 23 Sep 2002 21:19:08 -0000 1.25
--- Options.py 23 Sep 2002 21:20:10 -0000 1.26
***************
*** 301,303 ****
else:
options.mergefiles(['bayescustomize.ini'])
-
--- 301,302 ----
Index: cdb.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/cdb.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** cdb.py 22 Sep 2002 06:58:36 -0000 1.3
--- cdb.py 23 Sep 2002 21:20:10 -0000 1.4
***************
*** 19,23 ****
def uint32_pack(n):
return struct.pack(' 0:
driver.untrain(hams, spams)
!
driver.test(hams, spams)
driver.finishtest()
--- 161,165 ----
if i > 0:
driver.untrain(hams, spams)
!
driver.test(hams, spams)
driver.finishtest()
***************
*** 167,171 ****
if i < NSETS - 1:
driver.train(hams, spams)
!
i += 1
driver.alldone()
--- 167,171 ----
if i < NSETS - 1:
driver.train(hams, spams)
!
i += 1
driver.alldone()
Index: msgs.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/msgs.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** msgs.py 23 Sep 2002 20:18:34 -0000 1.2
--- msgs.py 23 Sep 2002 21:20:10 -0000 1.3
***************
*** 79,81 ****
HAMKEEP, SPAMKEEP = hamkeep, spamkeep
if seed is not None:
! SEED = seed
\ No newline at end of file
--- 79,81 ----
HAMKEEP, SPAMKEEP = hamkeep, spamkeep
if seed is not None:
! SEED = seed
Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** pop3proxy.py 23 Sep 2002 19:40:58 -0000 1.3
--- pop3proxy.py 23 Sep 2002 21:20:10 -0000 1.4
***************
*** 12,16 ****
defaults to 110.
! options (the same as hammie):
-p FILE : use the named data file
-d : the file is a DBM file rather than a pickle
--- 12,16 ----
defaults to 110.
! options (the same as hammie):
-p FILE : use the named data file
-d : the file is a DBM file rather than a pickle
***************
*** 46,50 ****
dispatchers created by a factory callable.
"""
!
def __init__(self, port, factory, factoryArgs=(),
socketMap=asyncore.socket_map):
--- 46,50 ----
dispatchers created by a factory callable.
"""
!
def __init__(self, port, factory, factoryArgs=(),
socketMap=asyncore.socket_map):
***************
*** 81,85 ****
server).
"""
!
def __init__(self, clientSocket, serverName, serverPort):
asynchat.async_chat.__init__(self, clientSocket)
--- 81,85 ----
server).
"""
!
def __init__(self, clientSocket, serverName, serverPort):
asynchat.async_chat.__init__(self, clientSocket)
***************
*** 90,98 ****
self.serverFile = serverSocket.makefile()
self.push(self.serverFile.readline())
!
def handle_connect(self):
"""Suppress the asyncore "unhandled connect event" warning."""
pass
!
def onTransaction(self, command, args, response):
"""Overide this. Takes the raw request and the response, and
--- 90,98 ----
self.serverFile = serverSocket.makefile()
self.push(self.serverFile.readline())
!
def handle_connect(self):
"""Suppress the asyncore "unhandled connect event" warning."""
pass
!
def onTransaction(self, command, args, response):
"""Overide this. Takes the raw request and the response, and
***************
*** 101,105 ****
"""
raise NotImplementedError
!
def isMultiline(self, command, args):
"""Returns True if the given request should get a multiline
--- 101,105 ----
"""
raise NotImplementedError
!
def isMultiline(self, command, args):
"""Returns True if the given request should get a multiline
***************
*** 116,120 ****
# Assume that unknown commands will get an error response.
return False
!
def readResponse(self, command, args):
"""Reads the POP3 server's response and returns a tuple of
--- 116,120 ----
# Assume that unknown commands will get an error response.
return False
!
def readResponse(self, command, args):
"""Reads the POP3 server's response and returns a tuple of
***************
*** 148,152 ****
# A normal line - append it to the response and carry on.
responseLines.append(line)
!
# Time out after 30 seconds - found_terminator() knows how
# to deal with this.
--- 148,152 ----
# A normal line - append it to the response and carry on.
responseLines.append(line)
!
# Time out after 30 seconds - found_terminator() knows how
# to deal with this.
***************
*** 154,166 ****
timedOut = True
break
!
isFirstLine = False
!
return ''.join(responseLines), isClosing, timedOut
!
def collect_incoming_data(self, data):
"""Asynchat override."""
self.request = self.request + data
!
def found_terminator(self):
"""Asynchat override."""
--- 154,166 ----
timedOut = True
break
!
isFirstLine = False
!
return ''.join(responseLines), isClosing, timedOut
!
def collect_incoming_data(self, data):
"""Asynchat override."""
self.request = self.request + data
!
def found_terminator(self):
"""Asynchat override."""
***************
*** 183,187 ****
args = splitCommand[1:]
rawResponse, isClosing, timedOut = self.readResponse(command, args)
!
# Pass the request and the raw response to the subclass and
# send back the cooked response.
--- 183,187 ----
args = splitCommand[1:]
rawResponse, isClosing, timedOut = self.readResponse(command, args)
!
# Pass the request and the raw response to the subclass and
# send back the cooked response.
***************
*** 189,193 ****
self.push(cookedResponse)
self.request = ''
!
# If readResponse() timed out, we still need to read and proxy
# the rest of the message.
--- 189,193 ----
self.push(cookedResponse)
self.request = ''
!
# If readResponse() timed out, we still need to read and proxy
# the rest of the message.
***************
*** 206,210 ****
# A normal line.
self.push(line)
!
# If readResponse() or the loop above decided that the server
# has closed its socket, close this one when the response has
--- 206,210 ----
# A normal line.
self.push(line)
!
# If readResponse() or the loop above decided that the server
# has closed its socket, close this one when the response has
***************
*** 212,216 ****
if isClosing:
self.close_when_done()
!
def handle_error(self):
"""Let SystemExit cause an exit."""
--- 212,216 ----
if isClosing:
self.close_when_done()
!
def handle_error(self):
"""Let SystemExit cause an exit."""
***************
*** 220,224 ****
else:
asynchat.async_chat.handle_error(self)
!
class BayesProxyListener(Listener):
--- 220,224 ----
else:
asynchat.async_chat.handle_error(self)
!
class BayesProxyListener(Listener):
***************
*** 226,230 ****
BayesProxy objects to serve them.
"""
!
def __init__(self, serverName, serverPort, proxyPort, bayes):
proxyArgs = (serverName, serverPort, bayes)
--- 226,230 ----
BayesProxy objects to serve them.
"""
!
def __init__(self, serverName, serverPort, proxyPort, bayes):
proxyArgs = (serverName, serverPort, bayes)
***************
*** 235,243 ****
"""Proxies between an email client and a POP3 server, inserting
judgement headers. It acts on the following POP3 commands:
!
o STAT:
o Adds the size of all the judgement headers to the maildrop
size.
!
o LIST:
o With no message number: adds the size of an judgement header
--- 235,243 ----
"""Proxies between an email client and a POP3 server, inserting
judgement headers. It acts on the following POP3 commands:
!
o STAT:
o Adds the size of all the judgement headers to the maildrop
size.
!
o LIST:
o With no message number: adds the size of an judgement header
***************
*** 245,253 ****
o With a message number: adds the size of an judgement header
to the message size.
!
o RETR:
o Adds the judgement header based on the raw headers and body
of the message.
!
o TOP:
o Adds the judgement header based on the raw headers and as
--- 245,253 ----
o With a message number: adds the size of an judgement header
to the message size.
!
o RETR:
o Adds the judgement header based on the raw headers and body
of the message.
!
o TOP:
o Adds the judgement header based on the raw headers and as
***************
*** 268,272 ****
self.handlers = {'STAT': self.onStat, 'LIST': self.onList,
'RETR': self.onRetr, 'TOP': self.onTop}
!
def send(self, data):
"""Logs the data to the log file."""
--- 268,272 ----
self.handlers = {'STAT': self.onStat, 'LIST': self.onList,
'RETR': self.onRetr, 'TOP': self.onTop}
!
def send(self, data):
"""Logs the data to the log file."""
***************
*** 274,278 ****
self.logFile.flush()
return POP3ProxyBase.send(self, data)
!
def recv(self, size):
"""Logs the data to the log file."""
--- 274,278 ----
self.logFile.flush()
return POP3ProxyBase.send(self, data)
!
def recv(self, size):
"""Logs the data to the log file."""
***************
*** 281,285 ****
self.logFile.flush()
return data
!
def onTransaction(self, command, args, response):
"""Takes the raw request and response, and returns the
--- 281,285 ----
self.logFile.flush()
return data
!
def onTransaction(self, command, args, response):
"""Takes the raw request and response, and returns the
***************
*** 299,303 ****
else:
return response
!
def onList(self, command, args, response):
"""Adds the size of an judgement header to the message
--- 299,303 ----
else:
return response
!
def onList(self, command, args, response):
"""Adds the size of an judgement header to the message
***************
*** 323,327 ****
else:
return response
!
def onRetr(self, command, args, response):
"""Adds the judgement header based on the raw headers and body
--- 323,327 ----
else:
return response
!
def onRetr(self, command, args, response):
"""Adds the judgement header based on the raw headers and body
***************
*** 332,336 ****
# Break off the first line, which will be '+OK'.
ok, message = response.split('\n', 1)
!
# Now find the spam disposition and add the header. The
# trailing space in "No " ensures consistent lengths - this
--- 332,336 ----
# Break off the first line, which will be '+OK'.
ok, message = response.split('\n', 1)
!
# Now find the spam disposition and add the header. The
# trailing space in "No " ensures consistent lengths - this
***************
*** 412,416 ****
"""Listener for TestPOP3Server. Works on port 8110, to co-exist
with real POP3 servers."""
!
def __init__(self, socketMap=asyncore.socket_map):
Listener.__init__(self, 8110, TestPOP3Server, socketMap=socketMap)
--- 412,416 ----
"""Listener for TestPOP3Server. Works on port 8110, to co-exist
with real POP3 servers."""
!
def __init__(self, socketMap=asyncore.socket_map):
Listener.__init__(self, 8110, TestPOP3Server, socketMap=socketMap)
***************
*** 423,427 ****
kill it. The mail content is the example messages above.
"""
!
def __init__(self, clientSocket, socketMap=asyncore.socket_map):
# Grumble: asynchat.__init__ doesn't take a 'map' argument,
--- 423,427 ----
kill it. The mail content is the example messages above.
"""
!
def __init__(self, clientSocket, socketMap=asyncore.socket_map):
# Grumble: asynchat.__init__ doesn't take a 'map' argument,
***************
*** 438,450 ****
self.push("+OK ready\r\n")
self.request = ''
!
def handle_connect(self):
"""Suppress the asyncore "unhandled connect event" warning."""
pass
!
def collect_incoming_data(self, data):
"""Asynchat override."""
self.request = self.request + data
!
def found_terminator(self):
"""Asynchat override."""
--- 438,450 ----
self.push("+OK ready\r\n")
self.request = ''
!
def handle_connect(self):
"""Suppress the asyncore "unhandled connect event" warning."""
pass
!
def collect_incoming_data(self, data):
"""Asynchat override."""
self.request = self.request + data
!
def found_terminator(self):
"""Asynchat override."""
***************
*** 464,468 ****
self.push(handler(command, args))
self.request = ''
!
def handle_error(self):
"""Let SystemExit cause an exit."""
--- 464,468 ----
self.push(handler(command, args))
self.request = ''
!
def handle_error(self):
"""Let SystemExit cause an exit."""
***************
*** 472,476 ****
else:
asynchat.async_chat.handle_error(self)
!
def onStat(self, command, args):
"""POP3 STAT command."""
--- 472,476 ----
else:
asynchat.async_chat.handle_error(self)
!
def onStat(self, command, args):
"""POP3 STAT command."""
***************
*** 478,482 ****
maildropSize += len(self.maildrop) * len(HEADER_EXAMPLE)
return "+OK %d %d\r\n" % (len(self.maildrop), maildropSize)
!
def onList(self, command, args):
"""POP3 LIST command, with optional message number argument."""
--- 478,482 ----
maildropSize += len(self.maildrop) * len(HEADER_EXAMPLE)
return "+OK %d %d\r\n" % (len(self.maildrop), maildropSize)
!
def onList(self, command, args):
"""POP3 LIST command, with optional message number argument."""
***************
*** 494,498 ****
returnLines.append(".")
return '\r\n'.join(returnLines) + '\r\n'
!
def onRetr(self, command, args):
"""POP3 RETR command."""
--- 494,498 ----
returnLines.append(".")
return '\r\n'.join(returnLines) + '\r\n'
!
def onRetr(self, command, args):
"""POP3 RETR command."""
***************
*** 522,526 ****
testServerReady.set()
asyncore.loop(map=testSocketMap)
!
def runProxy():
# Name the database in case it ever gets auto-flushed to disk.
--- 522,526 ----
testServerReady.set()
asyncore.loop(map=testSocketMap)
!
def runProxy():
# Name the database in case it ever gets auto-flushed to disk.
***************
*** 534,543 ****
testServerReady.wait()
threading.Thread(target=runProxy).start()
!
# Connect to the proxy.
proxy = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
proxy.connect(('localhost', 8111))
assert proxy.recv(100) == "+OK ready\r\n"
!
# Stat the mailbox to get the number of messages.
proxy.send("stat\r\n")
--- 534,543 ----
testServerReady.wait()
threading.Thread(target=runProxy).start()
!
# Connect to the proxy.
proxy = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
proxy.connect(('localhost', 8111))
assert proxy.recv(100) == "+OK ready\r\n"
!
# Stat the mailbox to get the number of messages.
proxy.send("stat\r\n")
***************
*** 546,550 ****
print "%d messages in test mailbox" % count
assert count == 2
!
# Loop through the messages ensuring that they have judgement
# headers.
--- 546,550 ----
print "%d messages in test mailbox" % count
assert count == 2
!
# Loop through the messages ensuring that they have judgement
# headers.
***************
*** 559,563 ****
header = response[headerOffset:headerEnd].strip()
print "Message %d: %s" % (i, header)
!
# Kill the proxy and the test server.
proxy.sendall("kill\r\n")
--- 559,563 ----
header = response[headerOffset:headerEnd].strip()
print "Message %d: %s" % (i, header)
!
# Kill the proxy and the test server.
proxy.sendall("kill\r\n")
***************
*** 592,596 ****
elif opt == '-p':
pickleName = arg
!
# Do whatever we've been asked to do...
if not opts and not args:
--- 592,596 ----
elif opt == '-p':
pickleName = arg
!
# Do whatever we've been asked to do...
if not opts and not args:
***************
*** 598,615 ****
test()
print "Self-test passed." # ...else it would have asserted.
!
elif runTestServer:
print "Running a test POP3 server on port 8110..."
TestListener()
asyncore.loop()
!
elif len(args) == 1:
# Named POP3 server, default port.
main(args[0], 110, 110, pickleName, useDB)
!
elif len(args) == 2:
# Named POP3 server, named port.
main(args[0], int(args[1]), 110, pickleName, useDB)
!
else:
print >>sys.stderr, __doc__
--- 598,615 ----
test()
print "Self-test passed." # ...else it would have asserted.
!
elif runTestServer:
print "Running a test POP3 server on port 8110..."
TestListener()
asyncore.loop()
!
elif len(args) == 1:
# Named POP3 server, default port.
main(args[0], 110, 110, pickleName, useDB)
!
elif len(args) == 2:
# Named POP3 server, named port.
main(args[0], int(args[1]), 110, pickleName, useDB)
!
else:
print >>sys.stderr, __doc__
Index: setup.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/setup.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** setup.py 7 Sep 2002 16:15:45 -0000 1.3
--- setup.py 23 Sep 2002 21:20:10 -0000 1.4
***************
*** 2,8 ****
setup(
! name='spambayes',
scripts=['unheader.py', 'hammie.py'],
py_modules=['classifier', 'tokenizer']
)
-
--- 2,7 ----
setup(
! name='spambayes',
scripts=['unheader.py', 'hammie.py'],
py_modules=['classifier', 'tokenizer']
)
Index: splitndirs.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/splitndirs.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** splitndirs.py 20 Sep 2002 20:00:45 -0000 1.3
--- splitndirs.py 23 Sep 2002 21:20:10 -0000 1.4
***************
*** 115,117 ****
if __name__ == '__main__':
main()
-
--- 115,116 ----
Index: unheader.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/unheader.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** unheader.py 22 Sep 2002 06:58:36 -0000 1.2
--- unheader.py 23 Sep 2002 21:20:10 -0000 1.3
***************
*** 29,56 ****
def deSA(msg):
if msg['X-Spam-Status']:
! if msg['X-Spam-Status'].startswith('Yes'):
! pct = msg['X-Spam-Prev-Content-Type']
! if pct:
! msg['Content-Type'] = pct
! pcte = msg['X-Spam-Prev-Content-Transfer-Encoding']
! if pcte:
! msg['Content-Transfer-Encoding'] = pcte
! subj = re.sub(r'\*\*\*\*\*SPAM\*\*\*\*\* ', '', msg['Subject'])
if subj != msg["Subject"]:
msg.replace_header("Subject", subj)
! body = msg.get_payload()
! newbody = []
! at_start = 1
! for line in body.splitlines():
! if at_start and line.startswith('SPAM: '):
! continue
! elif at_start:
! at_start = 0
! else:
! newbody.append(line)
! msg.set_payload("\n".join(newbody))
unheader(msg, "X-Spam-")
--- 29,56 ----
def deSA(msg):
if msg['X-Spam-Status']:
! if msg['X-Spam-Status'].startswith('Yes'):
! pct = msg['X-Spam-Prev-Content-Type']
! if pct:
! msg['Content-Type'] = pct
! pcte = msg['X-Spam-Prev-Content-Transfer-Encoding']
! if pcte:
! msg['Content-Transfer-Encoding'] = pcte
! subj = re.sub(r'\*\*\*\*\*SPAM\*\*\*\*\* ', '', msg['Subject'])
if subj != msg["Subject"]:
msg.replace_header("Subject", subj)
! body = msg.get_payload()
! newbody = []
! at_start = 1
! for line in body.splitlines():
! if at_start and line.startswith('SPAM: '):
! continue
! elif at_start:
! at_start = 0
! else:
! newbody.append(line)
! msg.set_payload("\n".join(newbody))
unheader(msg, "X-Spam-")
From gvanrossum@users.sourceforge.net Mon Sep 23 22:46:37 2002
From: gvanrossum@users.sourceforge.net (Guido van Rossum)
Date: Mon, 23 Sep 2002 14:46:37 -0700
Subject: [Spambayes-checkins] spambayes cmp.py,1.11,1.12
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv18710
Modified Files:
cmp.py
Log Message:
Changed CRLF to LF. (Some, but not all line endings were CRLF since
bkc's checkin.)
There's also a bug here: I ran this with rates.py output from a
previous version and it said
UnboundLocalError: local variable 'hamdevall' referenced before
assignment
But I don't know what value to initialize it (and spamdevall) to.
Index: cmp.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/cmp.py,v
retrieving revision 1.11
retrieving revision 1.12
diff -C2 -d -r1.11 -r1.12
*** cmp.py 23 Sep 2002 21:20:10 -0000 1.11
--- cmp.py 23 Sep 2002 21:46:34 -0000 1.12
***************
*** 22,55 ****
def suck(f):
fns = []
! fps = []
! hamdev = []
! spamdev = []
get = f.readline
while 1:
line = get()
! if line.startswith('-> tested'):
print line,
! if line.find('sample sdev') != -1:
! vals = line.split(';')
! mean = float(vals[1].split(' ')[-1])
! sdev = float(vals[2].split(' ')[-1])
! val = (mean,sdev)
! typ = vals[0].split(' ')[2]
! if line.find('for all runs') != -1:
! if typ == 'Ham':
! hamdevall = val
! else:
! spamdevall = val
! elif line.find('all in this') != -1:
! if typ == 'Ham':
! hamdev.append(val)
! else:
! spamdev.append(val)
continue
if line.startswith('-> '):
continue
if line.startswith('total'):
! break
# A line with an f-p rate and an f-n rate.
p, n = map(float, line.split())
--- 22,55 ----
def suck(f):
fns = []
! fps = []
! hamdev = []
! spamdev = []
get = f.readline
while 1:
line = get()
! if line.startswith('-> tested'):
print line,
! if line.find('sample sdev') != -1:
! vals = line.split(';')
! mean = float(vals[1].split(' ')[-1])
! sdev = float(vals[2].split(' ')[-1])
! val = (mean,sdev)
! typ = vals[0].split(' ')[2]
! if line.find('for all runs') != -1:
! if typ == 'Ham':
! hamdevall = val
! else:
! spamdevall = val
! elif line.find('all in this') != -1:
! if typ == 'Ham':
! hamdev.append(val)
! else:
! spamdev.append(val)
continue
if line.startswith('-> '):
continue
if line.startswith('total'):
! break
# A line with an f-p rate and an f-n rate.
p, n = map(float, line.split())
***************
*** 78,92 ****
t += " +(was 0)"
return t
!
! def mtag(m1,m2):
! mean1,dev1 = m1
! mean2,dev2 = m2
! mp = (mean2 - mean1) * 100.0 / mean1
! dp = (dev2 - dev1) * 100.0 / dev1
!
! return "%2.2f %2.2f (%+2.2f%%) %2.2f %2.2f (%+2.2f%%)" % (
! mean1,mean2,mp,
! dev1,dev2,dp
! )
def dump(p1s, p2s):
--- 78,92 ----
t += " +(was 0)"
return t
!
! def mtag(m1,m2):
! mean1,dev1 = m1
! mean2,dev2 = m2
! mp = (mean2 - mean1) * 100.0 / mean1
! dp = (dev2 - dev1) * 100.0 / dev1
!
! return "%2.2f %2.2f (%+2.2f%%) %2.2f %2.2f (%+2.2f%%)" % (
! mean1,mean2,mp,
! dev1,dev2,dp
! )
def dump(p1s, p2s):
***************
*** 100,105 ****
print "%-4s %2d times" % (t, alltags.count(t))
print
!
! def dumpdev(meandev1,meandev2):
for m1,m2 in zip(meandev1,meandev2):
print mtag(m1, m2)
--- 100,105 ----
print "%-4s %2d times" % (t, alltags.count(t))
print
!
! def dumpdev(meandev1,meandev2):
for m1,m2 in zip(meandev1,meandev2):
print mtag(m1, m2)
***************
*** 132,151 ****
print "total unique fn went from", fntot1, "to", fntot2, tag(fntot1, fntot2)
print "mean fn % went from", fnmean1, "to", fnmean2, tag(fnmean1, fnmean2)
!
! print
! print "ham mean ham sdev"
! dumpdev(hamdev1,hamdev2)
! print
! print "ham mean and sdev for all runs"
! dumpdev([hamdevall1],[hamdevall2])
!
! print
! print "spam mean spam sdev"
! dumpdev(spamdev1,spamdev2)
! print
! print "spam mean and sdev for all runs"
! dumpdev([spamdevall1],[spamdevall2])
! print
! diff1 = spamdevall1[0] - hamdevall1[0]
! diff2 = spamdevall2[0] - hamdevall2[0]
! print "ham/spam mean difference: %2.2f %2.2f %+2.2f" % (diff1,diff2,(diff2-diff1))
--- 132,151 ----
print "total unique fn went from", fntot1, "to", fntot2, tag(fntot1, fntot2)
print "mean fn % went from", fnmean1, "to", fnmean2, tag(fnmean1, fnmean2)
!
! print
! print "ham mean ham sdev"
! dumpdev(hamdev1,hamdev2)
! print
! print "ham mean and sdev for all runs"
! dumpdev([hamdevall1],[hamdevall2])
!
! print
! print "spam mean spam sdev"
! dumpdev(spamdev1,spamdev2)
! print
! print "spam mean and sdev for all runs"
! dumpdev([spamdevall1],[spamdevall2])
! print
! diff1 = spamdevall1[0] - hamdevall1[0]
! diff2 = spamdevall2[0] - hamdevall2[0]
! print "ham/spam mean difference: %2.2f %2.2f %+2.2f" % (diff1,diff2,(diff2-diff1))
From bkc@users.sourceforge.net Mon Sep 23 23:41:06 2002
From: bkc@users.sourceforge.net (Brad Clements)
Date: Mon, 23 Sep 2002 15:41:06 -0700
Subject: [Spambayes-checkins] spambayes TestDriver.py,1.8,1.9
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv3177
Modified Files:
TestDriver.py
Log Message:
allow global ham and spam histogram to be saved to a binary pickle
Index: TestDriver.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v
retrieving revision 1.8
retrieving revision 1.9
diff -C2 -d -r1.8 -r1.9
*** TestDriver.py 23 Sep 2002 21:19:08 -0000 1.8
--- TestDriver.py 23 Sep 2002 22:41:04 -0000 1.9
***************
*** 165,168 ****
--- 165,176 ----
if options.show_histograms:
printhist("all runs:", self.global_ham_hist, self.global_spam_hist)
+
+ if options.save_histogram_pickles:
+ for f, h in (('ham', self.global_ham_hist), ('spam', self.global_spam_hist)):
+ fname = "%s_%shist.pik" % (options.pickle_basename, f)
+ print " saving %s histogram pickle to %s" %(f, fname)
+ fp = file(fname, 'wb')
+ pickle.dump(h, fp, 1)
+ fp.close()
def test(self, ham, spam):
From bkc@users.sourceforge.net Mon Sep 23 23:41:55 2002
From: bkc@users.sourceforge.net (Brad Clements)
Date: Mon, 23 Sep 2002 15:41:55 -0700
Subject: [Spambayes-checkins] spambayes Options.py,1.26,1.27
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv3385
Modified Files:
Options.py
Log Message:
Add option to save global spam and ham history to pickles
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.26
retrieving revision 1.27
diff -C2 -d -r1.26 -r1.27
*** Options.py 23 Sep 2002 21:20:10 -0000 1.26
--- Options.py 23 Sep 2002 22:41:52 -0000 1.27
***************
*** 141,147 ****
# name already exists, it's overwritten. pickle_basename is ignored when
# save_trained_pickles is false.
save_trained_pickles: False
! pickle_basename: class
[Classifier]
--- 141,153 ----
# name already exists, it's overwritten. pickle_basename is ignored when
# save_trained_pickles is false.
+
+ # if save_histogram_pickles is true, Driver.train() saves a binary
+ # pickle of the spam and ham histogram for "all test runs". The file
+ # basename is given by pickle_basename, the suffix _spamhist.pik
+ # or _hamhist.pik is appended to the basename.
save_trained_pickles: False
! pickle_basename: class
! save_histogram_pickles: False
[Classifier]
***************
*** 226,229 ****
--- 232,236 ----
'show_best_discriminators': int_cracker,
'save_trained_pickles': boolean_cracker,
+ 'save_histogram_pickles': boolean_cracker,
'pickle_basename': string_cracker,
'show_charlimit': int_cracker,
From bkc@users.sourceforge.net Tue Sep 24 00:30:09 2002
From: bkc@users.sourceforge.net (Brad Clements)
Date: Mon, 23 Sep 2002 16:30:09 -0700
Subject: [Spambayes-checkins] spambayes HistToGNU.py,NONE,1.1
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv16649
Added Files:
HistToGNU.py
Log Message:
Initial version, convert hist pickles to gnuplot input
--- NEW FILE: HistToGNU.py ---
#! /usr/bin/env python
"""HistToGNU.py
Convert saved binary pickle of histograms to gnu plot output
"""
"""Usage: %(program)s [options] [histogrampicklefile ...]
reads pickle filename from options if not specified
writes to stdout
"""
globalOptions = """
set grid
set xtics 5
set xrange [0.0:100.0]
"""
dataSetOptions="smooth unique"
from Options import options
from TestDriver import Hist
import sys
import cPickle as pickle
program = sys.argv[0]
def usage(code, msg=''):
"""Print usage message and sys.exit(code)."""
if msg:
print >> sys.stderr, msg
print >> sys.stderr
print >> sys.stderr, __doc__ % globals()
sys.exit(code)
def loadHist(path):
"""Load the histogram pickle object"""
return pickle.load(file(path))
def outputHist(hist,f=sys.stdout):
"""Output the Hist object to file f"""
for i in range(len(hist.buckets)):
n = hist.buckets[i]
if n:
f.write("%.3f %d\n" % ( (100.0 * i) / hist.nbuckets, n))
def plot(files):
"""given a list of files, create gnu-plot file"""
import cStringIO, os
cmd = cStringIO.StringIO()
cmd.write(globalOptions)
args = []
for file in files:
args.append("""'-' %s title "%s" """ % (dataSetOptions,file))
cmd.write('plot %s\n' % ",".join(args))
for file in files:
outputHist(loadHist(file),cmd)
cmd.write('e\n')
cmd.write('pause 100\n')
print cmd.getvalue()
def main():
import getopt
try:
opts, args = getopt.getopt(sys.argv[1:], '',
[])
except getopt.error, msg:
usage(1, msg)
if not args and options.save_histogram_pickles:
args = []
for f in ('ham', 'spam'):
fname = "%s_%shist.pik" % (options.pickle_basename, f)
args.append(fname)
if args:
plot(args)
else:
print "could not locate any files to plot"
if __name__ == "__main__":
main()
From tim_one@users.sourceforge.net Tue Sep 24 01:37:34 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Mon, 23 Sep 2002 17:37:34 -0700
Subject: [Spambayes-checkins] spambayes README.txt,1.23,1.24
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv1923
Modified Files:
README.txt
Log Message:
Updated the blurb about requiring 2.3.
Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.23
retrieving revision 1.24
diff -C2 -d -r1.23 -r1.24
*** README.txt 23 Sep 2002 20:03:06 -0000 1.23
--- README.txt 24 Sep 2002 00:37:32 -0000 1.24
***************
*** 24,29 ****
too small to measure reliably across that much training data.
! The code here depends in various ways on the latest Python from CVS
! (a.k.a. Python 2.3a0 :-).
--- 24,28 ----
too small to measure reliably across that much training data.
! The code in this project requires Python 2.2.1 (or later).
From tim_one@users.sourceforge.net Tue Sep 24 01:38:39 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Mon, 23 Sep 2002 17:38:39 -0700
Subject: [Spambayes-checkins] spambayes hammie.py,1.20,1.21
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv2032
Modified Files:
hammie.py
Log Message:
Removed an obsolete 2.3 comment -- or maybe it isn't obsolete? If
hammie.py really requires 2.3, somebody put the comment back in .
Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.20
retrieving revision 1.21
diff -C2 -d -r1.20 -r1.21
*** hammie.py 23 Sep 2002 21:20:10 -0000 1.20
--- hammie.py 24 Sep 2002 00:38:37 -0000 1.21
***************
*** 1,4 ****
#! /usr/bin/env python
- # At the moment, this requires Python 2.3 from CVS
# A driver for the classifier module and Tim's tokenizer that you can
--- 1,3 ----
From tim_one@users.sourceforge.net Tue Sep 24 01:39:08 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Mon, 23 Sep 2002 17:39:08 -0700
Subject: [Spambayes-checkins] spambayes HistToGNU.py,1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv2336
Modified Files:
HistToGNU.py
Log Message:
Whitespace normalization.
Index: HistToGNU.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/HistToGNU.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** HistToGNU.py 23 Sep 2002 23:30:07 -0000 1.1
--- HistToGNU.py 24 Sep 2002 00:39:06 -0000 1.2
***************
*** 88,90 ****
if __name__ == "__main__":
main()
-
--- 88,89 ----
From guido@python.org Tue Sep 24 01:58:08 2002
From: guido@python.org (Guido van Rossum)
Date: Mon, 23 Sep 2002 20:58:08 -0400
Subject: [Spambayes-checkins] spambayes hammie.py,1.20,1.21
In-Reply-To: Your message of "Mon, 23 Sep 2002 17:38:39 PDT."
References:
Message-ID: <200209240058.g8O0w8o20276@pcp02138704pcs.reston01.va.comcast.net>
> Removed an obsolete 2.3 comment -- or maybe it isn't obsolete? If
> hammie.py really requires 2.3, somebody put the comment back in .
No, I tested it successfully with 2.2.1 last night.
--Guido van Rossum (home page: http://www.python.org/~guido/)
From tim_one@users.sourceforge.net Tue Sep 24 04:29:51 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Mon, 23 Sep 2002 20:29:51 -0700
Subject: [Spambayes-checkins]
spambayes Options.py,1.27,1.28 classifier.py,1.18,1.19
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv9255
Modified Files:
Options.py classifier.py
Log Message:
New option use_central_limit2 is Gary Robin's logarithmic variation of
the central-limit code.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.27
retrieving revision 1.28
diff -C2 -d -r1.27 -r1.28
*** Options.py 23 Sep 2002 22:41:52 -0000 1.27
--- Options.py 24 Sep 2002 03:29:48 -0000 1.28
***************
*** 203,206 ****
--- 203,210 ----
# square roots. An NxN test grid should work fine.
use_central_limit: False
+
+ # Same as use_central_limit, except takes logarithms of probabilities and
+ # probability complements (p and 1-p) instead.
+ use_central_limit2: False
"""
***************
*** 251,254 ****
--- 255,259 ----
'use_central_limit': boolean_cracker,
+ 'use_central_limit2': boolean_cracker,
},
}
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.18
retrieving revision 1.19
diff -C2 -d -r1.18 -r1.19
*** classifier.py 23 Sep 2002 21:19:08 -0000 1.18
--- classifier.py 24 Sep 2002 03:29:48 -0000 1.19
***************
*** 743,744 ****
--- 743,838 ----
if options.use_central_limit:
spamprob = central_limit_spamprob
+
+
+
+
+ def central_limit_compute_population_stats2(self, msgstream, is_spam):
+ from math import ldexp, log
+
+ sum = sumsq = 0
+ seen = {}
+ for msg in msgstream:
+ for prob, word, record in self._getclues(msg):
+ if word in seen:
+ continue
+ seen[word] = 1
+ if is_spam:
+ prob = log(prob)
+ else:
+ prob = log(1.0 - prob)
+ prob = long(ldexp(prob, 64))
+ sum += prob
+ sumsq += prob * prob
+ n = len(seen)
+
+ if is_spam:
+ self.spamn, self.spamsum, self.spamsumsq = n, sum, sumsq
+ spamsum = self.spamsum
+ self.spammean = ldexp(spamsum, -64) / self.spamn
+ spamvar = self.spamsumsq * self.spamn - spamsum**2
+ self.spamvar = ldexp(spamvar, -128) / (self.spamn ** 2)
+ print 'spammean', self.spammean, 'spamvar', self.spamvar
+ else:
+ self.hamn, self.hamsum, self.hamsumsq = n, sum, sumsq
+ hamsum = self.hamsum
+ self.hammean = ldexp(hamsum, -64) / self.hamn
+ hamvar = self.hamsumsq * self.hamn - hamsum**2
+ self.hamvar = ldexp(hamvar, -128) / (self.hamn ** 2)
+ print 'hammean', self.hammean, 'hamvar', self.hamvar
+
+ if options.use_central_limit2:
+ compute_population_stats = central_limit_compute_population_stats2
+
+ def central_limit_spamprob2(self, wordstream, evidence=False):
+ """Return best-guess probability that wordstream is spam.
+
+ wordstream is an iterable object producing words.
+ The return value is a float in [0.0, 1.0].
+
+ If optional arg evidence is True, the return value is a pair
+ probability, evidence
+ where evidence is a list of (word, probability) pairs.
+ """
+
+ from math import sqrt, log
+
+ clues = self._getclues(wordstream)
+ hsum = ssum = 0.0
+ for prob, word, record in clues:
+ ssum += log(prob)
+ hsum += log(1.0 - prob)
+ if record is not None:
+ record.killcount += 1
+ n = len(clues)
+ if n == 0:
+ return 0.5
+ hmean = hsum / n
+ smean = ssum / n
+
+ # If this sample is drawn from the spam population, its mean is
+ # distributed around spammean with variance spamvar/n. Likewise
+ # for if it's drawn from the ham population. Compute a normalized
+ # z-score (how many stddevs is it away from the population mean?)
+ # against both populations, and then it's ham or spam depending
+ # on which population it matches better.
+ zham = (hmean - self.hammean) / sqrt(self.hamvar / n)
+ zspam = (smean - self.spammean) / sqrt(self.spamvar / n)
+ stat = abs(zham) - abs(zspam) # > 0 for spam, < 0 for ham
+
+ # Normalize into [0, 1]. I'm arbitrarily clipping it to fit in
+ # [-20, 20] first. 20 is a massive z-score difference.
+ if stat < -20.0:
+ stat = -20.0
+ elif stat > 20.0:
+ stat = 20.0
+ stat = 0.5 + stat / 40.0
+
+ if evidence:
+ clues = [(word, prob) for prob, word, record in clues]
+ clues.sort(lambda a, b: cmp(a[1], b[1]))
+ return stat, clues
+ else:
+ return stat
+
+ if options.use_central_limit2:
+ spamprob = central_limit_spamprob2
From anthonybaxter@users.sourceforge.net Tue Sep 24 06:37:14 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Mon, 23 Sep 2002 22:37:14 -0700
Subject: [Spambayes-checkins]
spambayes Options.py,1.28,1.29 timcv.py,1.8,1.9 timtest.py,1.28,1.29
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv5802
Modified Files:
Options.py timcv.py timtest.py
Log Message:
Made the Data/Ham/SetN and Data/Spam/SetN things options that can be
over-ridden. Don't see why the rest of us should things this way just because
Tim thinks it's the correct way to do things
More importantly, means you can do test runs with different corpuses
(corpuscles? corpi? corpen?) at the same time.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.28
retrieving revision 1.29
diff -C2 -d -r1.28 -r1.29
*** Options.py 24 Sep 2002 03:29:48 -0000 1.28
--- Options.py 24 Sep 2002 05:37:11 -0000 1.29
***************
*** 151,154 ****
--- 151,160 ----
save_histogram_pickles: False
+ # default locations for timcv and timtest - these get the set number
+ # appended.
+ spam_directories: Data/Spam/Set%d
+ ham_directories: Data/Ham/Set%d
+
+
[Classifier]
# Fiddling these can have extreme effects. See classifier.py for comments.
***************
*** 240,243 ****
--- 246,251 ----
'show_charlimit': int_cracker,
'spam_cutoff': float_cracker,
+ 'spam_directories': string_cracker,
+ 'ham_directories': string_cracker,
},
'Classifier': {'hambias': float_cracker,
Index: timcv.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timcv.py,v
retrieving revision 1.8
retrieving revision 1.9
diff -C2 -d -r1.8 -r1.9
*** timcv.py 23 Sep 2002 20:18:34 -0000 1.8
--- timcv.py 24 Sep 2002 05:37:11 -0000 1.9
***************
*** 52,57 ****
print options.display()
! hamdirs = ["Data/Ham/Set%d" % i for i in range(1, nsets+1)]
! spamdirs = ["Data/Spam/Set%d" % i for i in range(1, nsets+1)]
d = TestDriver.Driver()
--- 52,57 ----
print options.display()
! hamdirs = [options.ham_directories % i for i in range(1, nsets+1)]
! spamdirs = [options.spam_directories % i for i in range(1, nsets+1)]
d = TestDriver.Driver()
Index: timtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtest.py,v
retrieving revision 1.28
retrieving revision 1.29
diff -C2 -d -r1.28 -r1.29
*** timtest.py 23 Sep 2002 20:18:34 -0000 1.28
--- timtest.py 24 Sep 2002 05:37:11 -0000 1.29
***************
*** 54,59 ****
print options.display()
! spamdirs = ["Data/Spam/Set%d" % i for i in range(1, nsets+1)]
! hamdirs = ["Data/Ham/Set%d" % i for i in range(1, nsets+1)]
spamhamdirs = zip(spamdirs, hamdirs)
--- 54,59 ----
print options.display()
! spamdirs = [options.spam_directories % i for i in range(1, nsets+1)]
! hamdirs = [options.ham_directories % i for i in range(1, nsets+1)]
spamhamdirs = zip(spamdirs, hamdirs)
From anthonybaxter@users.sourceforge.net Tue Sep 24 06:37:56 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Mon, 23 Sep 2002 22:37:56 -0700
Subject: [Spambayes-checkins] spambayes/email .cvsignore,NONE,1.1
Message-ID:
Update of /cvsroot/spambayes/spambayes/email
In directory usw-pr-cvs1:/tmp/cvs-serv6305
Added Files:
.cvsignore
Log Message:
silence mr. cvs
--- NEW FILE: .cvsignore ---
*.pyc
*.pyo
From anthonybaxter@users.sourceforge.net Tue Sep 24 07:13:32 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Mon, 23 Sep 2002 23:13:32 -0700
Subject: [Spambayes-checkins] spambayes Options.py,1.29,1.30
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv13054
Modified Files:
Options.py
Log Message:
corrected comment. fixed line endings (mixed dos and unix, ick)
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.29
retrieving revision 1.30
diff -C2 -d -r1.29 -r1.30
*** Options.py 24 Sep 2002 05:37:11 -0000 1.29
--- Options.py 24 Sep 2002 06:13:29 -0000 1.30
***************
*** 141,159 ****
# name already exists, it's overwritten. pickle_basename is ignored when
# save_trained_pickles is false.
!
! # if save_histogram_pickles is true, Driver.train() saves a binary
! # pickle of the spam and ham histogram for "all test runs". The file
! # basename is given by pickle_basename, the suffix _spamhist.pik
! # or _hamhist.pik is appended to the basename.
save_trained_pickles: False
! pickle_basename: class
save_histogram_pickles: False
# default locations for timcv and timtest - these get the set number
! # appended.
spam_directories: Data/Spam/Set%d
ham_directories: Data/Ham/Set%d
-
[Classifier]
--- 141,158 ----
# name already exists, it's overwritten. pickle_basename is ignored when
# save_trained_pickles is false.
!
! # if save_histogram_pickles is true, Driver.train() saves a binary
! # pickle of the spam and ham histogram for "all test runs". The file
! # basename is given by pickle_basename, the suffix _spamhist.pik
! # or _hamhist.pik is appended to the basename.
save_trained_pickles: False
! pickle_basename: class
save_histogram_pickles: False
# default locations for timcv and timtest - these get the set number
! # interpolated.
spam_directories: Data/Spam/Set%d
ham_directories: Data/Ham/Set%d
[Classifier]
From anthonybaxter@users.sourceforge.net Tue Sep 24 09:16:26 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Tue, 24 Sep 2002 01:16:26 -0700
Subject: [Spambayes-checkins] spambayes README.txt,1.24,1.25
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv15673
Modified Files:
README.txt
Log Message:
note about unheader.py
Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.24
retrieving revision 1.25
diff -C2 -d -r1.24 -r1.25
*** README.txt 24 Sep 2002 00:37:32 -0000 1.24
--- README.txt 24 Sep 2002 08:16:24 -0000 1.25
***************
*** 129,132 ****
--- 129,134 ----
A script to remove unwanted headers from an mbox file. This is mostly
useful to delete headers which incorrectly might bias the results.
+ In default mode, this is similar to 'spamassassin -d', but much, much
+ faster.
loosecksum.py
From sjoerd@users.sourceforge.net Tue Sep 24 12:43:09 2002
From: sjoerd@users.sourceforge.net (Sjoerd Mullender)
Date: Tue, 24 Sep 2002 04:43:09 -0700
Subject: [Spambayes-checkins] spambayes cmp.py,1.12,1.13
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv11535
Modified Files:
cmp.py
Log Message:
Protect against a mean of 0.
Index: cmp.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/cmp.py,v
retrieving revision 1.12
retrieving revision 1.13
diff -C2 -d -r1.12 -r1.13
*** cmp.py 23 Sep 2002 21:46:34 -0000 1.12
--- cmp.py 24 Sep 2002 11:43:06 -0000 1.13
***************
*** 82,92 ****
mean1,dev1 = m1
mean2,dev2 = m2
! mp = (mean2 - mean1) * 100.0 / mean1
! dp = (dev2 - dev1) * 100.0 / dev1
!
! return "%2.2f %2.2f (%+2.2f%%) %2.2f %2.2f (%+2.2f%%)" % (
! mean1,mean2,mp,
! dev1,dev2,dp
! )
def dump(p1s, p2s):
--- 82,98 ----
mean1,dev1 = m1
mean2,dev2 = m2
! t = "%7.2f %7.2f " % (mean1, mean2)
! if mean1:
! mp = (mean2 - mean1) * 100.0 / mean1
! t += "%+7.2f%%" % mp
! else:
! t += "+(was 0)"
! t += " %7.2f %7.2f " % (dev1, dev2)
! if dev1:
! dp = (dev2 - dev1) * 100.0 / dev1
! t += "%+7.2f%%" % dp
! else:
! t += "+(was 0)"
! return t
def dump(p1s, p2s):
***************
*** 134,138 ****
print
! print "ham mean ham sdev"
dumpdev(hamdev1,hamdev2)
print
--- 140,144 ----
print
! print "ham mean ham sdev"
dumpdev(hamdev1,hamdev2)
print
***************
*** 141,145 ****
print
! print "spam mean spam sdev"
dumpdev(spamdev1,spamdev2)
print
--- 147,151 ----
print
! print "spam mean spam sdev"
dumpdev(spamdev1,spamdev2)
print
From bkc@users.sourceforge.net Tue Sep 24 15:38:14 2002
From: bkc@users.sourceforge.net (Brad Clements)
Date: Tue, 24 Sep 2002 07:38:14 -0700
Subject: [Spambayes-checkins] spambayes HistToGNU.py,1.2,1.3
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv6641
Modified Files:
HistToGNU.py
Log Message:
Fix wrong __doc__ for usage
Index: HistToGNU.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/HistToGNU.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** HistToGNU.py 24 Sep 2002 00:39:06 -0000 1.2
--- HistToGNU.py 24 Sep 2002 14:38:10 -0000 1.3
***************
*** 5,11 ****
Convert saved binary pickle of histograms to gnu plot output
! """
!
! """Usage: %(program)s [options] [histogrampicklefile ...]
reads pickle filename from options if not specified
--- 5,9 ----
Convert saved binary pickle of histograms to gnu plot output
! Usage: %(program)s [options] [histogrampicklefile ...]
reads pickle filename from options if not specified
***************
*** 57,64 ****
args = []
for file in files:
! args.append("""'-' %s title "%s" """ % (dataSetOptions,file))
cmd.write('plot %s\n' % ",".join(args))
for file in files:
! outputHist(loadHist(file),cmd)
cmd.write('e\n')
--- 55,62 ----
args = []
for file in files:
! args.append("""'-' %s title "%s" """ % (dataSetOptions, file))
cmd.write('plot %s\n' % ",".join(args))
for file in files:
! outputHist(loadHist(file), cmd)
cmd.write('e\n')
From tim.one@comcast.net Tue Sep 24 18:00:01 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 24 Sep 2002 13:00:01 -0400
Subject: [Spambayes-checkins] spambayes Options.py,1.28,1.29
timcv.py,1.8,1.9 timtest.py,1.28,1.29
In-Reply-To:
Message-ID:
[Anthony Baxter]
> ...
> Log Message:
> Made the Data/Ham/SetN and Data/Spam/SetN things options that can be
> over-ridden. Don't see why the rest of us should things this way
> just because Tim thinks it's the correct way to do things
>
> More importantly, means you can do test runs with different corpuses
> (corpuscles? corpi? corpen?) at the same time.
It's a good change -- thanks. Before this, I simply renamed my directories.
Don't think that I haven't noticed you're complaining elsewhere that you
can't run even one test at a time .
From montanaro@users.sourceforge.net Tue Sep 24 19:00:00 2002
From: montanaro@users.sourceforge.net (Skip Montanaro)
Date: Tue, 24 Sep 2002 11:00:00 -0700
Subject: [Spambayes-checkins] spambayes unheader.py,1.3,1.4
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv25992
Modified Files:
unheader.py
Log Message:
guarantee at least an empty string for the subject
Index: unheader.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/unheader.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** unheader.py 23 Sep 2002 21:20:10 -0000 1.3
--- unheader.py 24 Sep 2002 17:59:58 -0000 1.4
***************
*** 38,42 ****
msg['Content-Transfer-Encoding'] = pcte
! subj = re.sub(r'\*\*\*\*\*SPAM\*\*\*\*\* ', '', msg['Subject'])
if subj != msg["Subject"]:
msg.replace_header("Subject", subj)
--- 38,43 ----
msg['Content-Transfer-Encoding'] = pcte
! subj = re.sub(r'\*\*\*\*\*SPAM\*\*\*\*\* ', '',
! msg['Subject'] or "")
if subj != msg["Subject"]:
msg.replace_header("Subject", subj)
From montanaro@users.sourceforge.net Tue Sep 24 19:07:19 2002
From: montanaro@users.sourceforge.net (Skip Montanaro)
Date: Tue, 24 Sep 2002 11:07:19 -0700
Subject: [Spambayes-checkins] spambayes setup.py,1.4,1.5
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv29473
Modified Files:
setup.py
Log Message:
add a bunch more modules and scripts
Index: setup.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/setup.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** setup.py 23 Sep 2002 21:20:10 -0000 1.4
--- setup.py 24 Sep 2002 18:07:17 -0000 1.5
***************
*** 2,7 ****
setup(
! name='spambayes',
! scripts=['unheader.py', 'hammie.py'],
! py_modules=['classifier', 'tokenizer']
)
--- 2,21 ----
setup(
! name='spambayes',
! scripts=['unheader.py',
! 'hammie.py',
! 'loosecksum.py',
! 'timtest.py',
! 'timcv.py',
! 'splitndirs.py',
! 'runtest.sh',
! 'rebal.py',
! 'cmp.py',
! 'rates.py'],
! py_modules=['classifier',
! 'tokenizer',
! 'Options',
! 'Tester',
! 'TestDriver',
! 'mboxutils']
)
From gvanrossum@users.sourceforge.net Tue Sep 24 19:26:13 2002
From: gvanrossum@users.sourceforge.net (Guido van Rossum)
Date: Tue, 24 Sep 2002 11:26:13 -0700
Subject: [Spambayes-checkins] spambayes splitndirs.py,1.4,1.5
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv5034
Modified Files:
splitndirs.py
Log Message:
Add -g option to glob each input path. This is handy on Windows.
Patch contributed by Alexander Leidinger.
Index: splitndirs.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/splitndirs.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** splitndirs.py 23 Sep 2002 21:20:10 -0000 1.4
--- splitndirs.py 24 Sep 2002 18:26:11 -0000 1.5
***************
*** 3,7 ****
"""Split an mbox into N random directories of files.
! Usage: %(program)s [-h] [-s seed] [-v] -n N sourcembox ... outdirbase
Options:
--- 3,7 ----
"""Split an mbox into N random directories of files.
! Usage: %(program)s [-h] [-g] [-s seed] [-v] -n N sourcembox ... outdirbase
Options:
***************
*** 9,12 ****
--- 9,17 ----
Print this help message and exit
+ -g
+ Do globbing on each sourcepath. This is helpful on Windows, where
+ the native shells don't glob, or when you have more mboxes than
+ your shell allows you to specify on the commandline.
+
-s seed
Seed the random number generator with seed (an integer).
***************
*** 22,26 ****
Arguments:
sourcembox
! The mbox to split.
outdirbase
--- 27,31 ----
Arguments:
sourcembox
! The mbox or path to an mbox to split.
outdirbase
***************
*** 46,49 ****
--- 51,55 ----
import email
import getopt
+ import glob
import mboxutils
***************
*** 65,72 ****
def main():
try:
! opts, args = getopt.getopt(sys.argv[1:], 'hn:s:v', ['help'])
except getopt.error, msg:
usage(1, msg)
n = None
verbose = False
--- 71,79 ----
def main():
try:
! opts, args = getopt.getopt(sys.argv[1:], 'hgn:s:v', ['help'])
except getopt.error, msg:
usage(1, msg)
+ doglob = False
n = None
verbose = False
***************
*** 74,77 ****
--- 81,86 ----
if opt in ('-h', '--help'):
usage(0)
+ elif opt == '-g':
+ doglob = True
elif opt == '-s':
random.seed(int(arg))
***************
*** 95,111 ****
counter = 0
for inputpath in inputpaths:
! mbox = mboxutils.getmbox(inputpath)
! for msg in mbox:
! i = random.randrange(n)
! astext = str(msg)
! #assert astext.endswith('\n')
! counter += 1
! msgfile = open('%s/%d' % (outdirs[i], counter), 'wb')
! msgfile.write(astext)
! msgfile.close()
! if verbose:
! if counter % 100 == 0:
! sys.stdout.write('.')
! sys.stdout.flush()
if verbose:
--- 104,126 ----
counter = 0
for inputpath in inputpaths:
! if doglob:
! inpaths = glob.glob(inputpath)
! else:
! inpaths = [inputpath]
!
! for inpath in inpaths:
! mbox = mboxutils.getmbox(inpath)
! for msg in mbox:
! i = random.randrange(n)
! astext = str(msg)
! #assert astext.endswith('\n')
! counter += 1
! msgfile = open('%s/%d' % (outdirs[i], counter), 'wb')
! msgfile.write(astext)
! msgfile.close()
! if verbose:
! if counter % 100 == 0:
! sys.stdout.write('.')
! sys.stdout.flush()
if verbose:
From anthony@interlink.com.au Tue Sep 24 23:00:27 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Wed, 25 Sep 2002 08:00:27 +1000
Subject: [Spambayes-checkins] spambayes Options.py,1.28,1.29
timcv.py,1.8,1.9 timtest.py,1.28,1.29
In-Reply-To:
Message-ID: <200209242200.g8OM0RV19871@localhost.localdomain>
>>> Tim Peters wrote
> It's a good change -- thanks. Before this, I simply renamed my directories.
> Don't think that I haven't noticed you're complaining elsewhere that you
> can't run even one test at a time .
Ha! Since when has consistency been an issue?
I'm actually doing tests with my smaller corpus of my personal spam+ham,
trying out the different sized spam:ham ratios.
Anthony
--
Anthony Baxter
It's never too late to have a happy childhood.
From tim_one@users.sourceforge.net Tue Sep 24 23:13:21 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Tue, 24 Sep 2002 15:13:21 -0700
Subject: [Spambayes-checkins] spambayes TestDriver.py,1.9,1.10
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv15036
Modified Files:
TestDriver.py
Log Message:
Changed the first histogram line so it fits in 79 columns.
Index: TestDriver.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v
retrieving revision 1.9
retrieving revision 1.10
diff -C2 -d -r1.9 -r1.10
*** TestDriver.py 23 Sep 2002 22:41:04 -0000 1.9
--- TestDriver.py 24 Sep 2002 22:13:19 -0000 1.10
***************
*** 90,98 ****
def printhist(tag, ham, spam):
print
! print "-> Ham distribution for", tag,
ham.display()
print
! print "-> Spam distribution for", tag,
spam.display()
--- 90,98 ----
def printhist(tag, ham, spam):
print
! print "-> Ham scores for", tag,
ham.display()
print
! print "-> Spam scores for", tag,
spam.display()
From tim_one@users.sourceforge.net Tue Sep 24 23:14:04 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Tue, 24 Sep 2002 15:14:04 -0700
Subject: [Spambayes-checkins] spambayes classifier.py,1.19,1.20
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv15524
Modified Files:
classifier.py
Log Message:
central_limit_compute_population_stats2(): Squashed code duplication.
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.19
retrieving revision 1.20
diff -C2 -d -r1.19 -r1.20
*** classifier.py 24 Sep 2002 03:29:48 -0000 1.19
--- classifier.py 24 Sep 2002 22:14:01 -0000 1.20
***************
*** 483,486 ****
--- 483,488 ----
# XXX More stuff should be reworked to use this as a helper function.
def _getclues(self, wordstream):
+ mindist = options.robinson_minimum_prob_strength
+
# A priority queue to remember the MAX_DISCRIMINATORS best
# probabilities, where "best" means largest distance from 0.5.
***************
*** 500,504 ****
distance = abs(prob - 0.5)
! if distance > smallest_best:
heapreplace(nbest, (distance, prob, word, record))
smallest_best = nbest[0][0]
--- 502,506 ----
distance = abs(prob - 0.5)
! if distance >= mindist and distance > smallest_best:
heapreplace(nbest, (distance, prob, word, record))
smallest_best = nbest[0][0]
***************
*** 764,782 ****
sum += prob
sumsq += prob * prob
n = len(seen)
if is_spam:
self.spamn, self.spamsum, self.spamsumsq = n, sum, sumsq
! spamsum = self.spamsum
! self.spammean = ldexp(spamsum, -64) / self.spamn
! spamvar = self.spamsumsq * self.spamn - spamsum**2
! self.spamvar = ldexp(spamvar, -128) / (self.spamn ** 2)
print 'spammean', self.spammean, 'spamvar', self.spamvar
else:
self.hamn, self.hamsum, self.hamsumsq = n, sum, sumsq
! hamsum = self.hamsum
! self.hammean = ldexp(hamsum, -64) / self.hamn
! hamvar = self.hamsumsq * self.hamn - hamsum**2
! self.hamvar = ldexp(hamvar, -128) / (self.hamn ** 2)
print 'hammean', self.hammean, 'hamvar', self.hamvar
--- 766,782 ----
sum += prob
sumsq += prob * prob
+
n = len(seen)
+ mean = ldexp(sum, -64) / n
+ var = sumsq * n - sum**2
+ var = ldexp(var, -128) / n**2
if is_spam:
self.spamn, self.spamsum, self.spamsumsq = n, sum, sumsq
! self.spammean, self.spamvar = mean, var
print 'spammean', self.spammean, 'spamvar', self.spamvar
else:
self.hamn, self.hamsum, self.hamsumsq = n, sum, sumsq
! self.hammean, self.hamvar = mean, var
print 'hammean', self.hammean, 'hamvar', self.hamvar
From gvanrossum@users.sourceforge.net Wed Sep 25 02:01:51 2002
From: gvanrossum@users.sourceforge.net (Guido van Rossum)
Date: Tue, 24 Sep 2002 18:01:51 -0700
Subject: [Spambayes-checkins] spambayes fpfn.py,NONE,1.1 README.txt,1.25,1.26
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv30699
Modified Files:
README.txt
Added Files:
fpfn.py
Log Message:
Add a tiny utility to extract the filenames of false positives/negatives
from the full test run output. (Tested with timcv.py output only.)
--- NEW FILE: fpfn.py ---
#! /usr/bin/env python
"""Extract false positive and false negative filenames from timcv.py output."""
import sys
import re
def cmpf(a, b):
# Sort function that sorts by numerical value
ma = re.search(r'(\d+)/(\d+)$', a)
mb = re.search(r'(\d+)/(\d+)$', b)
if ma and mb:
xa, ya = map(int, ma.groups())
xb, yb = map(int, mb.groups())
return cmp((xa, ya), (xb, yb))
else:
return cmp(a, b)
def main():
for name in sys.argv[1:]:
try:
f = open(name + ".txt")
except IOError:
f = open(name)
print "===", name, "==="
fp = []
fn = []
for line in f:
if line.startswith(' new fp: '):
fp.extend(eval(line[12:]))
elif line.startswith(' new fn: '):
fn.extend(eval(line[12:]))
fp.sort(cmpf)
fn.sort(cmpf)
print "--- fp ---"
for x in fp:
print x
print "--- fn ---"
for x in fn:
print x
if __name__ == '__main__':
main()
Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.25
retrieving revision 1.26
diff -C2 -d -r1.25 -r1.26
*** README.txt 24 Sep 2002 08:16:24 -0000 1.25
--- README.txt 25 Sep 2002 01:01:49 -0000 1.26
***************
*** 119,122 ****
--- 119,126 ----
and the change in average f-p and f-n rates.
+ fpfn.py
+ Given one or more TestDriver output files, prints list of false
+ positive and false negative filenames, one per line.
+
Test Data Utilities
From tim.one@comcast.net Wed Sep 25 02:21:07 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 24 Sep 2002 21:21:07 -0400
Subject: [Spambayes-checkins] spambayes fpfn.py,NONE,1.1
README.txt,1.25,1.26
In-Reply-To:
Message-ID:
[Guido]
> Modified Files:
> README.txt
> Added Files:
> fpfn.py
> Log Message:
> Add a tiny utility to extract the filenames of false positives/negatives
> from the full test run output. (Tested with timcv.py output only.)
The good news is that timcv doesn't print anything, except to dump out all
the options in effect at the start. All the printing is done by the
TestDriver module, and all the test drivers (timcv, timtest, mboxtest) use
that. So you've solved this problem for all of them! There's much method
behind all the seeming madness here .
From gward@users.sourceforge.net Wed Sep 25 03:02:43 2002
From: gward@users.sourceforge.net (Greg Ward)
Date: Tue, 24 Sep 2002 19:02:43 -0700
Subject: [Spambayes-checkins] spambayes unheader.py,1.4,1.5
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv19599
Modified Files:
unheader.py
Log Message:
Fix deSA() so it doesn't discard the first line of the body.
Change process_mailbox() to use email.Generator directly, in order
to disable header-wrapping and preserve headers as much as possible.
Index: unheader.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/unheader.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** unheader.py 24 Sep 2002 17:59:58 -0000 1.4
--- unheader.py 25 Sep 2002 02:02:41 -0000 1.5
***************
*** 6,9 ****
--- 6,10 ----
import email.Parser
import email.Message
+ import email.Generator
import getopt
***************
*** 51,60 ****
elif at_start:
at_start = 0
! else:
! newbody.append(line)
msg.set_payload("\n".join(newbody))
unheader(msg, "X-Spam-")
def process_mailbox(f, dosa=1, pats=None):
for msg in mailbox.PortableUnixMailbox(f, Parser().parse):
if pats is not None:
--- 52,61 ----
elif at_start:
at_start = 0
! newbody.append(line)
msg.set_payload("\n".join(newbody))
unheader(msg, "X-Spam-")
def process_mailbox(f, dosa=1, pats=None):
+ gen = email.Generator.Generator(sys.stdout, maxheaderlen=0)
for msg in mailbox.PortableUnixMailbox(f, Parser().parse):
if pats is not None:
***************
*** 62,66 ****
if dosa:
deSA(msg)
! print msg
def usage():
--- 63,67 ----
if dosa:
deSA(msg)
! gen(msg, unixfrom=1)
def usage():
From anthonybaxter@users.sourceforge.net Wed Sep 25 03:06:54 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Tue, 24 Sep 2002 19:06:54 -0700
Subject: [Spambayes-checkins] spambayes README.txt,1.26,1.27
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv20718
Modified Files:
README.txt
Log Message:
document BAYESCUSTOMIZE
Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.26
retrieving revision 1.27
diff -C2 -d -r1.26 -r1.27
*** README.txt 25 Sep 2002 01:01:49 -0000 1.26
--- README.txt 25 Sep 2002 02:06:52 -0000 1.27
***************
*** 41,44 ****
--- 41,49 ----
near the start, and consult attributes of options.
+ As an alternative to bayescustomize.ini, you can set the environment
+ variable BAYESCUSTOMIZE to a list of one or more .ini files, these will
+ be read in, in order, and applied to the options. This allows you to
+ tweak individual runs by combining fragments of .ini files.
+
classifier.py
An implementation of a Graham-like classifier.
From gward@users.sourceforge.net Wed Sep 25 03:09:00 2002
From: gward@users.sourceforge.net (Greg Ward)
Date: Tue, 24 Sep 2002 19:09:00 -0700
Subject: [Spambayes-checkins] spambayes unheader.py,1.5,1.6
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv21343
Modified Files:
unheader.py
Log Message:
Make Parser a HeaderParser subclass, so get_payload() returns the raw
message body. Necessary because doSA() assuems get_payload() always
returns a string, which isn't so if the message has MIME structure.
Index: unheader.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/unheader.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** unheader.py 25 Sep 2002 02:02:41 -0000 1.5
--- unheader.py 25 Sep 2002 02:08:58 -0000 1.6
***************
*** 24,28 ****
self._headers[i] = (k, newval)
! class Parser(email.Parser.Parser):
def __init__(self):
email.Parser.Parser.__init__(self, Message)
--- 24,28 ----
self._headers[i] = (k, newval)
! class Parser(email.Parser.HeaderParser):
def __init__(self):
email.Parser.Parser.__init__(self, Message)
From gvanrossum@users.sourceforge.net Wed Sep 25 03:09:54 2002
From: gvanrossum@users.sourceforge.net (Guido van Rossum)
Date: Tue, 24 Sep 2002 19:09:54 -0700
Subject: [Spambayes-checkins] spambayes README.txt,1.27,1.28
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv21573
Modified Files:
README.txt
Log Message:
Clarify how to make BAYESCUSTOMIZE into a list (the delimiter is
whitespace).
Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.27
retrieving revision 1.28
diff -C2 -d -r1.27 -r1.28
*** README.txt 25 Sep 2002 02:06:52 -0000 1.27
--- README.txt 25 Sep 2002 02:09:52 -0000 1.28
***************
*** 41,48 ****
near the start, and consult attributes of options.
! As an alternative to bayescustomize.ini, you can set the environment
! variable BAYESCUSTOMIZE to a list of one or more .ini files, these will
! be read in, in order, and applied to the options. This allows you to
! tweak individual runs by combining fragments of .ini files.
classifier.py
--- 41,49 ----
near the start, and consult attributes of options.
! As an alternative to bayescustomize.ini, you can set the
! environment variable BAYESCUSTOMIZE to a whitespace-separated list
! of one or more .ini files, these will be read in, in order, and
! applied to the options. This allows you to tweak individual runs
! by combining fragments of .ini files.
classifier.py
From gvanrossum@users.sourceforge.net Wed Sep 25 03:22:18 2002
From: gvanrossum@users.sourceforge.net (Guido van Rossum)
Date: Tue, 24 Sep 2002 19:22:18 -0700
Subject: [Spambayes-checkins] spambayes rates.py,1.6,1.7
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv25639
Modified Files:
rates.py
Log Message:
If basename ends in .txt, strip it off. I kept creating files named
foo.txts.txt because Unix filename completion adds the .txt part...
Index: rates.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/rates.py,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** rates.py 22 Sep 2002 04:19:08 -0000 1.6
--- rates.py 25 Sep 2002 02:22:15 -0000 1.7
***************
*** 35,38 ****
--- 35,40 ----
def doit(basename):
+ if basename.endswith('.txt'):
+ basename = basename[:-4]
try:
ifile = file(basename + '.txt')
From montanaro@users.sourceforge.net Wed Sep 25 03:45:33 2002
From: montanaro@users.sourceforge.net (Skip Montanaro)
Date: Tue, 24 Sep 2002 19:45:33 -0700
Subject: [Spambayes-checkins] spambayes Options.py,1.30,1.31
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv32737
Modified Files:
Options.py
Log Message:
change one quoted string from "-quotes to '-quotes to keep emacs-mode happy.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.30
retrieving revision 1.31
diff -C2 -d -r1.30 -r1.31
*** Options.py 24 Sep 2002 06:13:29 -0000 1.30
--- Options.py 25 Sep 2002 02:45:31 -0000 1.31
***************
*** 122,127 ****
show_false_negatives: False
! # Near the end of Driver.test(), you can get a listing of the "best
! # discriminators" in the words from the training sets. These are the
# words whose WordInfo.killcount values are highest, meaning they most
# often were among the most extreme clues spamprob() found. The number
--- 122,127 ----
show_false_negatives: False
! # Near the end of Driver.test(), you can get a listing of the 'best
! # discriminators' in the words from the training sets. These are the
# words whose WordInfo.killcount values are highest, meaning they most
# often were among the most extreme clues spamprob() found. The number
From tim_one@users.sourceforge.net Wed Sep 25 04:13:12 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Tue, 24 Sep 2002 20:13:12 -0700
Subject: [Spambayes-checkins] spambayes TestDriver.py,1.10,1.11
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv8305
Modified Files:
TestDriver.py
Log Message:
Compute population sdev instead of sample sdev for histogram displays;
it doesn't really matter for the purposes of histograms, and using
pop sdev makes it more consistent with the speculative central-limit
code.
Index: TestDriver.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v
retrieving revision 1.10
retrieving revision 1.11
diff -C2 -d -r1.10 -r1.11
*** TestDriver.py 24 Sep 2002 22:13:19 -0000 1.10
--- TestDriver.py 25 Sep 2002 03:13:09 -0000 1.11
***************
*** 64,76 ****
def display(self, WIDTH=60):
from math import sqrt
! if self.n > 1:
mean = self.sum / self.n
! # sum (x_i - mean)**2 = sum (x_i**2 - 2*x_i*mean + mean**2) =
! # sum x_i**2 - 2*mean*sum x_i + sum mean**2 =
! # sum x_i**2 - 2*mean*mean*n + n*mean**2 =
! # sum x_i**2 - n*mean**2
! samplevar = (self.sumsq - self.n * mean**2) / (self.n - 1)
! print "%d items; mean %.2f; sample sdev %.2f" % (self.n,
! mean, sqrt(samplevar))
biggest = max(self.buckets)
--- 64,71 ----
def display(self, WIDTH=60):
from math import sqrt
! if self.n > 0:
mean = self.sum / self.n
! var = self.sumsq / self.n - mean**2
! print "%d items; mean %.2f; sdev %.2f" % (self.n, mean, sqrt(var))
biggest = max(self.buckets)
From tim_one@users.sourceforge.net Wed Sep 25 04:16:52 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Tue, 24 Sep 2002 20:16:52 -0700
Subject: [Spambayes-checkins] spambayes cmp.py,1.13,1.14
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv9307
Modified Files:
cmp.py
Log Message:
Dang. Changing the histogram output broke pattern-matching code here.
Index: cmp.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/cmp.py,v
retrieving revision 1.13
retrieving revision 1.14
diff -C2 -d -r1.13 -r1.14
*** cmp.py 24 Sep 2002 11:43:06 -0000 1.13
--- cmp.py 25 Sep 2002 03:16:50 -0000 1.14
***************
*** 18,22 ****
# total f-n,
# average f-p rate,
! # average f-n rate)
# from summary file f.
def suck(f):
--- 18,22 ----
# total f-n,
# average f-p rate,
! # average f-n rate,
# from summary file f.
def suck(f):
***************
*** 31,35 ****
if line.startswith('-> tested'):
print line,
! if line.find('sample sdev') != -1:
vals = line.split(';')
mean = float(vals[1].split(' ')[-1])
--- 31,35 ----
if line.startswith('-> tested'):
print line,
! if line.find('; sdev ') != -1:
vals = line.split(';')
mean = float(vals[1].split(' ')[-1])
***************
*** 65,69 ****
fpmean = float(get().split()[-1])
fnmean = float(get().split()[-1])
! return fps, fns, fptot, fntot, fpmean, fnmean, hamdev, spamdev,hamdevall,spamdevall
def tag(p1, p2):
--- 65,70 ----
fpmean = float(get().split()[-1])
fnmean = float(get().split()[-1])
! return (fps, fns, fptot, fntot, fpmean, fnmean,
! hamdev, spamdev, hamdevall, spamdevall)
def tag(p1, p2):
From tim_one@users.sourceforge.net Wed Sep 25 04:26:43 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Tue, 24 Sep 2002 20:26:43 -0700
Subject: [Spambayes-checkins] spambayes cmp.py,1.14,1.15
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv11854
Modified Files:
cmp.py
Log Message:
Repaired more consequences of the pattern-matching stuff.
Index: cmp.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/cmp.py,v
retrieving revision 1.14
retrieving revision 1.15
diff -C2 -d -r1.14 -r1.15
*** cmp.py 25 Sep 2002 03:16:50 -0000 1.14
--- cmp.py 25 Sep 2002 03:26:40 -0000 1.15
***************
*** 19,22 ****
--- 19,27 ----
# average f-p rate,
# average f-n rate,
+ # list of all ham score deviations,
+ # list of all spam score deviations,
+ # ham score deviation for all runs,
+ # spam score deviations for all runs,
+ # )
# from summary file f.
def suck(f):
***************
*** 31,40 ****
if line.startswith('-> tested'):
print line,
! if line.find('; sdev ') != -1:
vals = line.split(';')
! mean = float(vals[1].split(' ')[-1])
! sdev = float(vals[2].split(' ')[-1])
! val = (mean,sdev)
! typ = vals[0].split(' ')[2]
if line.find('for all runs') != -1:
if typ == 'Ham':
--- 36,47 ----
if line.startswith('-> tested'):
print line,
! if line.find(' items; mean ') != -1:
! "-> Ham distribution for this pair: 1000 items; mean 0.05; sample sdev 0.68"
! # and later "sample " went away
vals = line.split(';')
! mean = float(vals[1].split()[-1])
! sdev = float(vals[2].split()[-1])
! val = (mean, sdev)
! typ = vals[0].split()[2]
if line.find('for all runs') != -1:
if typ == 'Ham':
From tim_one@users.sourceforge.net Wed Sep 25 04:29:03 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Tue, 24 Sep 2002 20:29:03 -0700
Subject: [Spambayes-checkins] spambayes cmp.py,1.15,1.16
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv12543
Modified Files:
cmp.py
Log Message:
Split long lines, added commas.
Index: cmp.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/cmp.py,v
retrieving revision 1.15
retrieving revision 1.16
diff -C2 -d -r1.15 -r1.16
*** cmp.py 25 Sep 2002 03:26:40 -0000 1.15
--- cmp.py 25 Sep 2002 03:29:01 -0000 1.16
***************
*** 37,41 ****
print line,
if line.find(' items; mean ') != -1:
! "-> Ham distribution for this pair: 1000 items; mean 0.05; sample sdev 0.68"
# and later "sample " went away
vals = line.split(';')
--- 37,41 ----
print line,
if line.find(' items; mean ') != -1:
! # -> Ham distribution for this pair: 1000 items; mean 0.05; sample sdev 0.68
# and later "sample " went away
vals = line.split(';')
***************
*** 132,137 ****
f2n = windowsfy(f2n)
! fp1, fn1, fptot1, fntot1, fpmean1, fnmean1,hamdev1,spamdev1,hamdevall1,spamdevall1 = suck(file(f1n))
! fp2, fn2, fptot2, fntot2, fpmean2, fnmean2,hamdev2,spamdev2,hamdevall2,spamdevall2 = suck(file(f2n))
print
--- 132,140 ----
f2n = windowsfy(f2n)
! (fp1, fn1, fptot1, fntot1, fpmean1, fnmean1,
! hamdev1, spamdev1, hamdevall1, spamdevall1) = suck(file(f1n))
!
! (fp2, fn2, fptot2, fntot2, fpmean2, fnmean2,
! hamdev2, spamdev2, hamdevall2, spamdevall2) = suck(file(f2n))
print
***************
*** 163,165 ****
diff1 = spamdevall1[0] - hamdevall1[0]
diff2 = spamdevall2[0] - hamdevall2[0]
! print "ham/spam mean difference: %2.2f %2.2f %+2.2f" % (diff1,diff2,(diff2-diff1))
--- 166,170 ----
diff1 = spamdevall1[0] - hamdevall1[0]
diff2 = spamdevall2[0] - hamdevall2[0]
! print "ham/spam mean difference: %2.2f %2.2f %+2.2f" % (diff1,
! diff2,
! diff2 - diff1)
From tim_one@users.sourceforge.net Wed Sep 25 06:22:49 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Tue, 24 Sep 2002 22:22:49 -0700
Subject: [Spambayes-checkins]
spambayes Options.py,1.31,1.32 TestDriver.py,1.11,1.12
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv12473
Modified Files:
Options.py TestDriver.py
Log Message:
New option compute_best_cutoffs_from_histograms, enabled by default.
This automates analyzing histograms to find "the best" spam_cutoff.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.31
retrieving revision 1.32
diff -C2 -d -r1.31 -r1.32
*** Options.py 25 Sep 2002 02:45:31 -0000 1.31
--- Options.py 25 Sep 2002 05:22:47 -0000 1.32
***************
*** 111,114 ****
--- 111,120 ----
show_histograms: True
+ # When compute_best_cutoffs_from_histograms is enabled, after the display
+ # of a ham+spam histogram pair, a listing is given of all the cutoff scores
+ # (coinciding with a histogram boundary) that minimize the total number of
+ # misclassified messages (false positives + false negatives).
+ compute_best_cutoffs_from_histograms: True
+
# Display spam when
# show_spam_lo <= spamprob <= show_spam_hi
***************
*** 151,155 ****
save_histogram_pickles: False
! # default locations for timcv and timtest - these get the set number
# interpolated.
spam_directories: Data/Spam/Set%d
--- 157,161 ----
save_histogram_pickles: False
! # default locations for timcv and timtest - these get the set number
# interpolated.
spam_directories: Data/Spam/Set%d
***************
*** 247,250 ****
--- 253,257 ----
'spam_directories': string_cracker,
'ham_directories': string_cracker,
+ 'compute_best_cutoffs_from_histograms': boolean_cracker,
},
'Classifier': {'hambias': float_cracker,
Index: TestDriver.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v
retrieving revision 1.11
retrieving revision 1.12
diff -C2 -d -r1.11 -r1.12
*** TestDriver.py 25 Sep 2002 03:13:09 -0000 1.11
--- TestDriver.py 25 Sep 2002 05:22:47 -0000 1.12
***************
*** 92,95 ****
--- 92,130 ----
spam.display()
+ if not options.compute_best_cutoffs_from_histograms:
+ return
+
+ # Figure out "the best" spam cutoff point, meaning the one that minimizes
+ # the total number of misclassified msgs (other definitions are
+ # certainly possible!).
+
+ # At cutoff 0, everything is called spam, so there are no false negatives,
+ # and every ham is a false positive.
+ assert ham.nbuckets == spam.nbuckets
+ fp = ham.n
+ fn = 0
+ best_total = fp
+ bests = [(0, fp, fn)]
+ for i in range(ham.nbuckets):
+ # When moving the cutoff beyond bucket i, the ham in bucket i
+ # are redeemed, and the spam in bucket i become false negatives.
+ fp -= ham.buckets[i]
+ fn += spam.buckets[i]
+ if fp + fn <= best_total:
+ if fp + fn < best_total:
+ best_total = fp + fn
+ bests = []
+ bests.append((i+1, fp, fn))
+ assert fp == 0
+ assert fn == spam.n
+
+ i, fp, fn = bests.pop(0)
+ print '-> best cutoff for', tag, float(i) / ham.nbuckets
+ print '-> with', fp, 'fp', '+', fn, 'fn =', best_total, 'mistakes'
+ for i, fp, fn in bests:
+ print '-> matched at %g (%d fp + %d fn)' % (
+ float(i) / ham.nbuckets, fp, fn)
+
+
def printmsg(msg, prob, clues):
print msg.tag
From gvanrossum@users.sourceforge.net Wed Sep 25 17:24:29 2002
From: gvanrossum@users.sourceforge.net (Guido van Rossum)
Date: Wed, 25 Sep 2002 09:24:29 -0700
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.33,1.34
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv26958
Modified Files:
tokenizer.py
Log Message:
get_charsets() can return a charset that is a triple of the form
(encoding, language, data). Extract the data, assuming the encoding
is an ASCII superset and the data (a charset name) is in fact just
ascii characters. (The only occurrence in real life of this I've seen
uses an encoding name "ansi-x3-4-1968", which is an obscure name for
ASCII that Python's codecs collection doesn't seem to support.
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.33
retrieving revision 1.34
diff -C2 -d -r1.33 -r1.34
*** tokenizer.py 23 Sep 2002 14:38:41 -0000 1.33
--- tokenizer.py 25 Sep 2002 16:24:26 -0000 1.34
***************
*** 724,727 ****
--- 724,730 ----
for x in msg.get_charsets(None):
if x is not None:
+ if isinstance(x, tuple):
+ assert len(x) == 3
+ x = x[2]
yield 'charset:' + x.lower()
From gward@users.sourceforge.net Wed Sep 25 18:56:12 2002
From: gward@users.sourceforge.net (Greg Ward)
Date: Wed, 25 Sep 2002 10:56:12 -0700
Subject: [Spambayes-checkins] spambayes unheader.py,1.6,1.7
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv24962a
Modified Files:
unheader.py
Log Message:
Add Maildir support:
* add -d option
* rearrange main() accordingly (NB. I removed the ability to read
an mbox folder from stdin, since it didn't actually work and
made main() more complicated)
* add process_maildir()
* factor process_message() out of process_mailbox()
Index: unheader.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/unheader.py,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** unheader.py 25 Sep 2002 02:08:58 -0000 1.6
--- unheader.py 25 Sep 2002 17:56:09 -0000 1.7
***************
*** 3,6 ****
--- 3,8 ----
import re
import sys
+ import os
+ import glob
import mailbox
import email.Parser
***************
*** 56,79 ****
unheader(msg, "X-Spam-")
def process_mailbox(f, dosa=1, pats=None):
gen = email.Generator.Generator(sys.stdout, maxheaderlen=0)
for msg in mailbox.PortableUnixMailbox(f, Parser().parse):
! if pats is not None:
! unheader(msg, pats)
! if dosa:
! deSA(msg)
gen(msg, unixfrom=1)
def usage():
! print >> sys.stderr, "usage: unheader.py [ -p pat ... ] [ -s ]"
print >> sys.stderr, "-p pat gives a regex pattern used to eliminate unwanted headers"
print >> sys.stderr, "'-p pat' may be given multiple times"
print >> sys.stderr, "-s tells not to remove SpamAssassin headers"
def main(args):
headerpats = []
dosa = 1
try:
! opts, args = getopt.getopt(args, "p:sh")
except getopt.GetoptError:
usage()
--- 58,101 ----
unheader(msg, "X-Spam-")
+ def process_message(msg, dosa, pats):
+ if pats is not None:
+ unheader(msg, pats)
+ if dosa:
+ deSA(msg)
+
def process_mailbox(f, dosa=1, pats=None):
gen = email.Generator.Generator(sys.stdout, maxheaderlen=0)
for msg in mailbox.PortableUnixMailbox(f, Parser().parse):
! process_message(msg, dosa, pats)
gen(msg, unixfrom=1)
+ def process_maildir(d, dosa=1, pats=None):
+ parser = Parser()
+ for fn in glob.glob(os.path.join(d, "cur", "*")):
+ print ("reading from %s..." % fn),
+ file = open(fn)
+ msg = parser.parse(file)
+ process_message(msg, dosa, pats)
+
+ tmpfn = os.path.join(d, "tmp", os.path.basename(fn))
+ tmpfile = open(tmpfn, "w")
+ print "writing to %s" % tmpfn
+ email.Generator.Generator(tmpfile, maxheaderlen=0)(msg, unixfrom=0)
+
+ os.rename(tmpfn, fn)
+
def usage():
! print >> sys.stderr, "usage: unheader.py [ -p pat ... ] [ -s ] folder"
print >> sys.stderr, "-p pat gives a regex pattern used to eliminate unwanted headers"
print >> sys.stderr, "'-p pat' may be given multiple times"
print >> sys.stderr, "-s tells not to remove SpamAssassin headers"
+ print >> sys.stderr, "-d means treat folder as a Maildir"
def main(args):
headerpats = []
dosa = 1
+ ismbox = 1
try:
! opts, args = getopt.getopt(args, "p:shd")
except getopt.GetoptError:
usage()
***************
*** 88,100 ****
elif opt == "-s":
dosa = 0
pats = headerpats and "|".join(headerpats) or None
! if not args:
! f = sys.stdin
! elif len(args) == 1:
! f = file(args[0])
! else:
usage()
sys.exit(1)
! process_mailbox(f, dosa, pats)
if __name__ == "__main__":
--- 110,126 ----
elif opt == "-s":
dosa = 0
+ elif opt == "-d":
+ ismbox = 0
pats = headerpats and "|".join(headerpats) or None
!
! if len(args) != 1:
usage()
sys.exit(1)
!
! if ismbox:
! f = file(args[0])
! process_mailbox(f, dosa, pats)
! else:
! process_maildir(args[0], dosa, pats)
if __name__ == "__main__":
From tim_one@users.sourceforge.net Wed Sep 25 19:39:22 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Wed, 25 Sep 2002 11:39:22 -0700
Subject: [Spambayes-checkins]
spambayes Options.py,1.32,1.33 TestDriver.py,1.12,1.13
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv15075
Modified Files:
Options.py TestDriver.py
Log Message:
New option best_cutoff_fp_weight. The histogram analysis code now
finds the buckets that minimize
best_cutoff_fp_weight * (# false positives) + (# false negatives)
By default it's 1 (minimize total # of misclassified msgs). If, e.g.,
you're happy to endure 100 false negatives to save 1 false positive,
set to 100. Don't be surprised if your f-n rate zooms, though!
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.32
retrieving revision 1.33
diff -C2 -d -r1.32 -r1.33
*** Options.py 25 Sep 2002 05:22:47 -0000 1.32
--- Options.py 25 Sep 2002 18:39:17 -0000 1.33
***************
*** 102,108 ****
# well as 0.90 on Tim's large c.l.py data).
# For Gary Robinson's scheme, some value between 0.50 and 0.60 has worked
! # best in all reports so far. Note that you can easily deduce the effect
! # of setting spam_cutoff to any particular value by studying the score
! # histograms -- there's no need to run a test again to see what would happen.
spam_cutoff: 0.90
--- 102,106 ----
# well as 0.90 on Tim's large c.l.py data).
# For Gary Robinson's scheme, some value between 0.50 and 0.60 has worked
! # best in all reports so far.
spam_cutoff: 0.90
***************
*** 111,119 ****
show_histograms: True
! # When compute_best_cutoffs_from_histograms is enabled, after the display
! # of a ham+spam histogram pair, a listing is given of all the cutoff scores
! # (coinciding with a histogram boundary) that minimize the total number of
! # misclassified messages (false positives + false negatives).
compute_best_cutoffs_from_histograms: True
# Display spam when
--- 109,127 ----
show_histograms: True
! # After the display of a ham+spam histogram pair, you can get a listing of
! # all the cutoff values (coinciding histogram bucket boundaries) that
! # minimize
! #
! # best_cutoff_fp_weight * (# false positives) + (# false negatives)
! #
! # By default, best_cutoff_fp_weight is 1, and so the cutoffs that miminize
! # the total number of misclassified messages (fp+fn) are shown. If you hate
! # fp more than fn, set the weight to something larger than 1. For example,
! # if you're willing to endure 100 false negatives to save 1 false positive,
! # set it to 100.
! # Note: You may wish to increase nbuckets, to give this scheme more cutoff
! # values to analyze.
compute_best_cutoffs_from_histograms: True
+ best_cutoff_fp_weight: 1
# Display spam when
***************
*** 254,257 ****
--- 262,266 ----
'ham_directories': string_cracker,
'compute_best_cutoffs_from_histograms': boolean_cracker,
+ 'best_cutoff_fp_weight': float_cracker,
},
'Classifier': {'hambias': float_cracker,
Index: TestDriver.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v
retrieving revision 1.12
retrieving revision 1.13
diff -C2 -d -r1.12 -r1.13
*** TestDriver.py 25 Sep 2002 05:22:47 -0000 1.12
--- TestDriver.py 25 Sep 2002 18:39:17 -0000 1.13
***************
*** 102,108 ****
# and every ham is a false positive.
assert ham.nbuckets == spam.nbuckets
fp = ham.n
fn = 0
! best_total = fp
bests = [(0, fp, fn)]
for i in range(ham.nbuckets):
--- 102,109 ----
# and every ham is a false positive.
assert ham.nbuckets == spam.nbuckets
+ fpw = options.best_cutoff_fp_weight
fp = ham.n
fn = 0
! best_total = fpw * fp + fn
bests = [(0, fp, fn)]
for i in range(ham.nbuckets):
***************
*** 111,117 ****
fp -= ham.buckets[i]
fn += spam.buckets[i]
! if fp + fn <= best_total:
! if fp + fn < best_total:
! best_total = fp + fn
bests = []
bests.append((i+1, fp, fn))
--- 112,119 ----
fp -= ham.buckets[i]
fn += spam.buckets[i]
! total = fpw * fp + fn
! if total <= best_total:
! if total < best_total:
! best_total = total
bests = []
bests.append((i+1, fp, fn))
***************
*** 121,128 ****
i, fp, fn = bests.pop(0)
print '-> best cutoff for', tag, float(i) / ham.nbuckets
! print '-> with', fp, 'fp', '+', fn, 'fn =', best_total, 'mistakes'
for i, fp, fn in bests:
! print '-> matched at %g (%d fp + %d fn)' % (
! float(i) / ham.nbuckets, fp, fn)
--- 123,135 ----
i, fp, fn = bests.pop(0)
print '-> best cutoff for', tag, float(i) / ham.nbuckets
! print '-> with weighted total %g*%d fp + %d fn = %g' % (
! fpw, fp, fn, best_total)
! print '-> fp rate %.3g%% fn rate %.3g%%' % (
! fp * 1e2 / ham.n, fn * 1e2 / spam.n)
for i, fp, fn in bests:
! print ('-> matched at %g with %d fp & %d fn; '
! 'fp rate %.3g%%; fn rate %.3g%%' % (
! float(i) / ham.nbuckets, fp, fn,
! fp * 1e2 / ham.n, fn * 1e2 / spam.n))
From gward@users.sourceforge.net Wed Sep 25 21:07:10 2002
From: gward@users.sourceforge.net (Greg Ward)
Date: Wed, 25 Sep 2002 13:07:10 -0700
Subject: [Spambayes-checkins] spambayes msgs.py,1.3,1.4
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv7639
Modified Files:
msgs.py
Log Message:
Python 2.2 compat.
Index: msgs.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/msgs.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** msgs.py 23 Sep 2002 21:20:10 -0000 1.3
--- msgs.py 25 Sep 2002 20:07:06 -0000 1.4
***************
*** 1,2 ****
--- 1,4 ----
+ from __future__ import generators
+
import os
import random
From tim_one@users.sourceforge.net Thu Sep 26 02:10:32 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Wed, 25 Sep 2002 18:10:32 -0700
Subject: [Spambayes-checkins] spambayes TestDriver.py,1.13,1.14
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv10786
Modified Files:
TestDriver.py
Log Message:
The numerically naive way of computing the sdev for the histogram
display finally went negative on me. This isn't worth fixing right --
just call it 0 when this happens here.
Index: TestDriver.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v
retrieving revision 1.13
retrieving revision 1.14
diff -C2 -d -r1.13 -r1.14
*** TestDriver.py 25 Sep 2002 18:39:17 -0000 1.13
--- TestDriver.py 26 Sep 2002 01:10:29 -0000 1.14
***************
*** 67,70 ****
--- 67,75 ----
mean = self.sum / self.n
var = self.sumsq / self.n - mean**2
+ # The vagaries of f.p. rounding can make var come out negative.
+ # There are ways to fix that, but they're too painful for this
+ # part of the code to endure.
+ if var < 0.0:
+ var = 0.0
print "%d items; mean %.2f; sdev %.2f" % (self.n, mean, sqrt(var))
From tim_one@users.sourceforge.net Thu Sep 26 04:20:53 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Wed, 25 Sep 2002 20:20:53 -0700
Subject: [Spambayes-checkins] spambayes cmp.py,1.16,1.17
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv7354
Modified Files:
cmp.py
Log Message:
Restored ability to analyze older result files (from before the time
ham & spam mean & sdevs were displayed).
Added more commas.
Index: cmp.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/cmp.py,v
retrieving revision 1.16
retrieving revision 1.17
diff -C2 -d -r1.16 -r1.17
*** cmp.py 25 Sep 2002 03:29:01 -0000 1.16
--- cmp.py 26 Sep 2002 03:20:51 -0000 1.17
***************
*** 30,33 ****
--- 30,34 ----
hamdev = []
spamdev = []
+ hamdevall = spamdevall = (0.0, 0.0)
get = f.readline
***************
*** 87,93 ****
return t
! def mtag(m1,m2):
! mean1,dev1 = m1
! mean2,dev2 = m2
t = "%7.2f %7.2f " % (mean1, mean2)
if mean1:
--- 88,94 ----
return t
! def mtag(m1, m2):
! mean1, dev1 = m1
! mean2, dev2 = m2
t = "%7.2f %7.2f " % (mean1, mean2)
if mean1:
***************
*** 115,120 ****
print
! def dumpdev(meandev1,meandev2):
! for m1,m2 in zip(meandev1,meandev2):
print mtag(m1, m2)
--- 116,121 ----
print
! def dumpdev(meandev1, meandev2):
! for m1, m2 in zip(meandev1, meandev2):
print mtag(m1, m2)
***************
*** 151,170 ****
print
! print "ham mean ham sdev"
! dumpdev(hamdev1,hamdev2)
! print
! print "ham mean and sdev for all runs"
! dumpdev([hamdevall1],[hamdevall2])
! print
! print "spam mean spam sdev"
! dumpdev(spamdev1,spamdev2)
! print
! print "spam mean and sdev for all runs"
! dumpdev([spamdevall1],[spamdevall2])
! print
! diff1 = spamdevall1[0] - hamdevall1[0]
! diff2 = spamdevall2[0] - hamdevall2[0]
! print "ham/spam mean difference: %2.2f %2.2f %+2.2f" % (diff1,
! diff2,
! diff2 - diff1)
--- 152,176 ----
print
! if len(hamdev1) == len(hamdev2) and len(spamdev1) == len(spamdev2):
! print "ham mean ham sdev"
! dumpdev(hamdev1, hamdev2)
! print
! print "ham mean and sdev for all runs"
! dumpdev([hamdevall1], [hamdevall2])
!
! print
! print "spam mean spam sdev"
! dumpdev(spamdev1, spamdev2)
! print
! print "spam mean and sdev for all runs"
! dumpdev([spamdevall1], [spamdevall2])
!
! print
! diff1 = spamdevall1[0] - hamdevall1[0]
! diff2 = spamdevall2[0] - hamdevall2[0]
! print "ham/spam mean difference: %2.2f %2.2f %+2.2f" % (diff1,
! diff2,
! diff2 - diff1)
! else:
! print "[info about ham & spam means & sdevs not available in both files]"
From barry@users.sourceforge.net Thu Sep 26 04:22:58 2002
From: barry@users.sourceforge.net (Barry Warsaw)
Date: Wed, 25 Sep 2002 20:22:58 -0700
Subject: [Spambayes-checkins] spambayes/email __init__.py,1.1.1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/spambayes/email
In directory usw-pr-cvs1:/tmp/cvs-serv8023
Modified Files:
__init__.py
Log Message:
On Guido's request, backporting mimelib change:
Move the imports of Parser and Message inside the
message_from_string() and message_from_file() functions. This way
just "import email" won't suck in most of the submodules of the
package.
Note: this will break code that relied on "import email" giving you a
bunch of the submodules, but that was never documented and should not
have been relied on.
However, I'm setting __version__ to 2.4a0 since 2.4 has not yet been
released (waiting for closure on a few outstanding issues).
Index: __init__.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/email/__init__.py,v
retrieving revision 1.1.1.1
retrieving revision 1.2
diff -C2 -d -r1.1.1.1 -r1.2
*** __init__.py 23 Sep 2002 13:18:55 -0000 1.1.1.1
--- __init__.py 26 Sep 2002 03:22:56 -0000 1.2
***************
*** 5,9 ****
"""
! __version__ = '2.3.1'
__all__ = ['Charset',
--- 5,9 ----
"""
! __version__ = '2.4a0'
__all__ = ['Charset',
***************
*** 29,39 ****
! # Some convenience routines
! from email.Parser import Parser as _Parser
! from email.Message import Message as _Message
!
! def message_from_string(s, _class=_Message, strict=0):
! return _Parser(_class, strict=strict).parsestr(s)
! def message_from_file(fp, _class=_Message, strict=0):
! return _Parser(_class, strict=strict).parse(fp)
--- 29,46 ----
! # Some convenience routines. Don't import Parser and Message as side-effects
! # of importing email since those cascadingly import most of the rest of the
! # email package.
! def message_from_string(s, _class=None, strict=0):
! from email.Parser import Parser
! if _class is None:
! from email.Message import Message
! _class = Message
! return Parser(_class, strict=strict).parsestr(s)
! def message_from_file(fp, _class=None, strict=0):
! from email.Parser import Parser
! if _class is None:
! from email.Message import Message
! _class = Message
! return Parser(_class, strict=strict).parse(fp)
From anthonybaxter@users.sourceforge.net Thu Sep 26 09:24:33 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Thu, 26 Sep 2002 01:24:33 -0700
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.34,1.35
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv11559
Modified Files:
tokenizer.py
Log Message:
a number of my remaining false positives were caused by HTML email
with inline stylesheets. These were punished because things like
COLOR: #ffffff now are only seem in badly formatted spams with HTML
in non-HTML MIME parts. Strip out the stylesheets when we strip out
HTML.
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.34
retrieving revision 1.35
diff -C2 -d -r1.34 -r1.35
*** tokenizer.py 25 Sep 2002 16:24:26 -0000 1.34
--- tokenizer.py 26 Sep 2002 08:24:30 -0000 1.35
***************
*** 575,578 ****
--- 575,582 ----
""", re.VERBOSE)
+ # An equally cheap-ass gimmick to strip style sheets
+ stylesheet_re = re.compile(r"",
+ re.IGNORECASE|re.DOTALL)
+
received_host_re = re.compile(r'from (\S+)\s')
received_ip_re = re.compile(r'\s[[(]((\d{1,3}\.?){4})[\])]')
***************
*** 1040,1043 ****
--- 1044,1048 ----
not options.retain_pure_html_tags):
text = html_re.sub(' ', text)
+ text = stylesheet_re.sub(' ', text)
# Tokenize everything in the body.
From anthonybaxter@users.sourceforge.net Thu Sep 26 09:35:08 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Thu, 26 Sep 2002 01:35:08 -0700
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.35,1.36
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv14947
Modified Files:
tokenizer.py
Log Message:
*sigh* do them in the right order.
This is why we run the full test before we do Mr. Checkin, isn't
it, Anthony?
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.35
retrieving revision 1.36
diff -C2 -d -r1.35 -r1.36
*** tokenizer.py 26 Sep 2002 08:24:30 -0000 1.35
--- tokenizer.py 26 Sep 2002 08:35:06 -0000 1.36
***************
*** 1043,1048 ****
if (part.get_content_type() == "text/plain" or
not options.retain_pure_html_tags):
- text = html_re.sub(' ', text)
text = stylesheet_re.sub(' ', text)
# Tokenize everything in the body.
--- 1043,1048 ----
if (part.get_content_type() == "text/plain" or
not options.retain_pure_html_tags):
text = stylesheet_re.sub(' ', text)
+ text = html_re.sub(' ', text)
# Tokenize everything in the body.
From sjoerd@users.sourceforge.net Thu Sep 26 09:40:22 2002
From: sjoerd@users.sourceforge.net (Sjoerd Mullender)
Date: Thu, 26 Sep 2002 01:40:22 -0700
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.36,1.37
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv16737
Modified Files:
tokenizer.py
Log Message:
Import email.Message and email.Errors explicitly.
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.36
retrieving revision 1.37
diff -C2 -d -r1.36 -r1.37
*** tokenizer.py 26 Sep 2002 08:35:06 -0000 1.36
--- tokenizer.py 26 Sep 2002 08:40:20 -0000 1.37
***************
*** 5,8 ****
--- 5,10 ----
import email
+ import email.Message
+ import email.Errors
import re
from sets import Set
From sjoerd@users.sourceforge.net Thu Sep 26 09:46:13 2002
From: sjoerd@users.sourceforge.net (Sjoerd Mullender)
Date: Thu, 26 Sep 2002 01:46:13 -0700
Subject: [Spambayes-checkins] spambayes HistToGNU.py,1.3,1.4
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv18312
Modified Files:
HistToGNU.py
Log Message:
Converted \r\n line endings to \n.
Index: HistToGNU.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/HistToGNU.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** HistToGNU.py 24 Sep 2002 14:38:10 -0000 1.3
--- HistToGNU.py 26 Sep 2002 08:46:11 -0000 1.4
***************
*** 1,30 ****
! #! /usr/bin/env python
!
! """HistToGNU.py
!
! Convert saved binary pickle of histograms to gnu plot output
!
! Usage: %(program)s [options] [histogrampicklefile ...]
!
! reads pickle filename from options if not specified
!
! writes to stdout
- """
-
- globalOptions = """
- set grid
- set xtics 5
- set xrange [0.0:100.0]
- """
-
- dataSetOptions="smooth unique"
-
from Options import options
! from TestDriver import Hist
!
! import sys
import cPickle as pickle
!
program = sys.argv[0]
--- 1,30 ----
! #! /usr/bin/env python
!
! """HistToGNU.py
!
! Convert saved binary pickle of histograms to gnu plot output
!
! Usage: %(program)s [options] [histogrampicklefile ...]
!
! reads pickle filename from options if not specified
!
! writes to stdout
!
! """
!
! globalOptions = """
! set grid
! set xtics 5
! set xrange [0.0:100.0]
! """
!
! dataSetOptions="smooth unique"
from Options import options
! from TestDriver import Hist
!
! import sys
import cPickle as pickle
!
program = sys.argv[0]
***************
*** 36,67 ****
print >> sys.stderr, __doc__ % globals()
sys.exit(code)
!
! def loadHist(path):
! """Load the histogram pickle object"""
! return pickle.load(file(path))
!
! def outputHist(hist,f=sys.stdout):
! """Output the Hist object to file f"""
! for i in range(len(hist.buckets)):
! n = hist.buckets[i]
! if n:
! f.write("%.3f %d\n" % ( (100.0 * i) / hist.nbuckets, n))
!
! def plot(files):
! """given a list of files, create gnu-plot file"""
! import cStringIO, os
! cmd = cStringIO.StringIO()
! cmd.write(globalOptions)
! args = []
! for file in files:
! args.append("""'-' %s title "%s" """ % (dataSetOptions, file))
! cmd.write('plot %s\n' % ",".join(args))
! for file in files:
! outputHist(loadHist(file), cmd)
! cmd.write('e\n')
!
! cmd.write('pause 100\n')
! print cmd.getvalue()
!
def main():
import getopt
--- 36,67 ----
print >> sys.stderr, __doc__ % globals()
sys.exit(code)
!
! def loadHist(path):
! """Load the histogram pickle object"""
! return pickle.load(file(path))
!
! def outputHist(hist,f=sys.stdout):
! """Output the Hist object to file f"""
! for i in range(len(hist.buckets)):
! n = hist.buckets[i]
! if n:
! f.write("%.3f %d\n" % ( (100.0 * i) / hist.nbuckets, n))
!
! def plot(files):
! """given a list of files, create gnu-plot file"""
! import cStringIO, os
! cmd = cStringIO.StringIO()
! cmd.write(globalOptions)
! args = []
! for file in files:
! args.append("""'-' %s title "%s" """ % (dataSetOptions, file))
! cmd.write('plot %s\n' % ",".join(args))
! for file in files:
! outputHist(loadHist(file), cmd)
! cmd.write('e\n')
!
! cmd.write('pause 100\n')
! print cmd.getvalue()
!
def main():
import getopt
***************
*** 72,87 ****
except getopt.error, msg:
usage(1, msg)
!
! if not args and options.save_histogram_pickles:
! args = []
! for f in ('ham', 'spam'):
! fname = "%s_%shist.pik" % (options.pickle_basename, f)
! args.append(fname)
!
! if args:
! plot(args)
! else:
! print "could not locate any files to plot"
!
! if __name__ == "__main__":
! main()
--- 72,87 ----
except getopt.error, msg:
usage(1, msg)
!
! if not args and options.save_histogram_pickles:
! args = []
! for f in ('ham', 'spam'):
! fname = "%s_%shist.pik" % (options.pickle_basename, f)
! args.append(fname)
!
! if args:
! plot(args)
! else:
! print "could not locate any files to plot"
!
! if __name__ == "__main__":
! main()
From sjoerd@users.sourceforge.net Thu Sep 26 09:47:32 2002
From: sjoerd@users.sourceforge.net (Sjoerd Mullender)
Date: Thu, 26 Sep 2002 01:47:32 -0700
Subject: [Spambayes-checkins] spambayes HistToGNU.py,1.4,1.5
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv18679
Modified Files:
HistToGNU.py
Log Message:
Output all values since if you have a large value and then many 0 values,
the line would just be a gentle slope instead of dropping down sharply.
Index: HistToGNU.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/HistToGNU.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** HistToGNU.py 26 Sep 2002 08:46:11 -0000 1.4
--- HistToGNU.py 26 Sep 2002 08:47:29 -0000 1.5
***************
*** 45,50 ****
for i in range(len(hist.buckets)):
n = hist.buckets[i]
! if n:
! f.write("%.3f %d\n" % ( (100.0 * i) / hist.nbuckets, n))
def plot(files):
--- 45,49 ----
for i in range(len(hist.buckets)):
n = hist.buckets[i]
! f.write("%.3f %d\n" % ( (100.0 * i) / hist.nbuckets, n))
def plot(files):
From barry@users.sourceforge.net Thu Sep 26 21:22:17 2002
From: barry@users.sourceforge.net (Barry Warsaw)
Date: Thu, 26 Sep 2002 13:22:17 -0700
Subject: [Spambayes-checkins] spambayes/email Message.py,1.1.1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/spambayes/email
In directory usw-pr-cvs1:/tmp/cvs-serv6467
Modified Files:
Message.py
Log Message:
Side-porting from the email package:
Fixing some RFC 2231 related issues as reported in the Spambayes
project, and with assistance from Oleg Broytmann. Specifically,
get_param(), get_params(): Document that these methods may return
parameter values that are either strings, or 3-tuples in the case of
RFC 2231 encoded parameters. The application should be prepared to
deal with such return values.
get_boundary(): Be prepared to deal with RFC 2231 encoded boundary
parameters. It makes little sense to have boundaries that are
anything but ascii, so if we get back a 3-tuple from get_param() we
will decode it into ascii and let any failures percolate up.
get_content_charset(): New method which treats the charset parameter
just like the boundary parameter in get_boundary(). Note that
"get_charset()" was already taken to return the default Charset
object.
get_charsets(): Rewrite to use get_content_charset().
Index: Message.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/email/Message.py,v
retrieving revision 1.1.1.1
retrieving revision 1.2
diff -C2 -d -r1.1.1.1 -r1.2
*** Message.py 23 Sep 2002 13:18:55 -0000 1.1.1.1
--- Message.py 26 Sep 2002 20:22:15 -0000 1.2
***************
*** 54,58 ****
def _unquotevalue(value):
if isinstance(value, TupleType):
! return (value[0], value[1], Utils.unquote(value[2]))
else:
return Utils.unquote(value)
--- 54,58 ----
def _unquotevalue(value):
if isinstance(value, TupleType):
! return value[0], value[1], Utils.unquote(value[2])
else:
return Utils.unquote(value)
***************
*** 510,515 ****
split on the `=' sign. The left hand side of the `=' is the key,
while the right hand side is the value. If there is no `=' sign in
! the parameter the value is the empty string. The value is always
! unquoted, unless unquote is set to a false value.
Optional failobj is the object to return if there is no Content-Type:
--- 510,515 ----
split on the `=' sign. The left hand side of the `=' is the key,
while the right hand side is the value. If there is no `=' sign in
! the parameter the value is the empty string. The value is as
! described in the get_param() method.
Optional failobj is the object to return if there is no Content-Type:
***************
*** 530,538 ****
Optional failobj is the object to return if there is no Content-Type:
! header. Optional header is the header to search instead of
! Content-Type:
! Parameter keys are always compared case insensitively. Values are
! always unquoted, unless unquote is set to a false value.
"""
if not self.has_key(header):
--- 530,550 ----
Optional failobj is the object to return if there is no Content-Type:
! header, or the Content-Type header has no such parameter. Optional
! header is the header to search instead of Content-Type:
! Parameter keys are always compared case insensitively. The return
! value can either be a string, or a 3-tuple if the parameter was RFC
! 2231 encoded. When it's a 3-tuple, the elements of the value are of
! the form (CHARSET, LANGUAGE, VALUE), where LANGUAGE may be the empty
! string. Your application should be prepared to deal with these, and
! can convert the parameter to a Unicode string like so:
!
! param = msg.get_param('foo')
! if isinstance(param, tuple):
! param = unicode(param[2], param[0])
!
! In any case, the parameter value (either the returned string, or the
! VALUE item in the 3-tuple) is always unquoted, unless unquote is set
! to a false value.
"""
if not self.has_key(header):
***************
*** 675,678 ****
--- 687,693 ----
if boundary is missing:
return failobj
+ if isinstance(boundary, TupleType):
+ # RFC 2231 encoded, so decode. It better end up as ascii
+ return unicode(boundary[2], boundary[0]).encode('us-ascii')
return _unquotevalue(boundary.strip())
***************
*** 728,731 ****
--- 743,761 ----
from email._compat21 import walk
+ def get_content_charset(self, failobj=None):
+ """Return the charset parameter of the Content-Type header.
+
+ If there is no Content-Type header, or if that header has no charset
+ parameter, failobj is returned.
+ """
+ missing = []
+ charset = self.get_param('charset', missing)
+ if charset is missing:
+ return failobj
+ if isinstance(charset, TupleType):
+ # RFC 2231 encoded, so decode it, and it better end up as ascii.
+ return unicode(charset[2], charset[0]).encode('us-ascii')
+ return charset
+
def get_charsets(self, failobj=None):
"""Return a list containing the charset(s) used in this message.
***************
*** 744,746 ****
message will still return a list of length 1.
"""
! return [part.get_param('charset', failobj) for part in self.walk()]
--- 774,776 ----
message will still return a list of length 1.
"""
! return [part.get_content_charset(failobj) for part in self.walk()]
From gvanrossum@users.sourceforge.net Thu Sep 26 21:26:04 2002
From: gvanrossum@users.sourceforge.net (Guido van Rossum)
Date: Thu, 26 Sep 2002 13:26:04 -0700
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.37,1.38
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv7821
Modified Files:
tokenizer.py
Log Message:
Now that the email package has been updated, we don't need to deal
with triples returned by get_charsets(). But we need to fix the
aliases dictionary to include 'ansi_x3_4_1968'.
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.37
retrieving revision 1.38
diff -C2 -d -r1.37 -r1.38
*** tokenizer.py 26 Sep 2002 08:40:20 -0000 1.37
--- tokenizer.py 26 Sep 2002 20:26:02 -0000 1.38
***************
*** 12,15 ****
--- 12,21 ----
from Options import options
+ # Patch encodings.aliases to recognize 'ansi_x3_4_1968'
+ from encodings.aliases import aliases # The aliases dictionary
+ if not aliases.has_key('ansi_x3_4_1968'):
+ aliases['ansi_x3_4_1968'] = 'ascii'
+ del aliases # Not needed any more
+
##############################################################################
# To fold case or not to fold case? I didn't want to fold case, because
***************
*** 730,736 ****
for x in msg.get_charsets(None):
if x is not None:
- if isinstance(x, tuple):
- assert len(x) == 3
- x = x[2]
yield 'charset:' + x.lower()
--- 736,739 ----
From tim_one@users.sourceforge.net Fri Sep 27 01:08:15 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 26 Sep 2002 17:08:15 -0700
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.38,1.39
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv13081
Modified Files:
tokenizer.py
Log Message:
stylesheet_re: removed the IGNORCASE. The text is already lower()ed,
and IGNORECASE makes the engine do extra work.
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.38
retrieving revision 1.39
diff -C2 -d -r1.38 -r1.39
*** tokenizer.py 26 Sep 2002 20:26:02 -0000 1.38
--- tokenizer.py 27 Sep 2002 00:08:13 -0000 1.39
***************
*** 584,589 ****
# An equally cheap-ass gimmick to strip style sheets
! stylesheet_re = re.compile(r"",
! re.IGNORECASE|re.DOTALL)
received_host_re = re.compile(r'from (\S+)\s')
--- 584,588 ----
# An equally cheap-ass gimmick to strip style sheets
! stylesheet_re = re.compile(r"", re.DOTALL)
received_host_re = re.compile(r'from (\S+)\s')
From tim_one@users.sourceforge.net Fri Sep 27 02:28:46 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Thu, 26 Sep 2002 18:28:46 -0700
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.39,1.40
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv30138
Modified Files:
tokenizer.py
Log Message:
Beefed up HTML stripping: Accepts more kinds of ", re.DOTALL)
received_host_re = re.compile(r'from (\S+)\s')
--- 578,596 ----
html_re = re.compile(r"""
<
! (?![\s<>]) # e.g., don't match 'a < b' or '<<<' or 'i<<5' or 'a<>b'
! (?:
! # style sheets can be very long
! style\b # maybe it's ]{0,256} # search for the end '>', but don't run wild
! )
>
! """, re.VERBOSE | re.DOTALL)
received_host_re = re.compile(r'from (\S+)\s')
***************
*** 1047,1051 ****
if (part.get_content_type() == "text/plain" or
not options.retain_pure_html_tags):
- text = stylesheet_re.sub(' ', text)
text = html_re.sub(' ', text)
--- 1055,1058 ----
From nascheme@users.sourceforge.net Fri Sep 27 05:03:02 2002
From: nascheme@users.sourceforge.net (Neil Schemenauer)
Date: Thu, 26 Sep 2002 21:03:02 -0700
Subject: [Spambayes-checkins] spambayes Options.py,1.33,1.34
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv4115
Modified Files:
Options.py
Log Message:
Add mine_message_ids option.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.33
retrieving revision 1.34
diff -C2 -d -r1.33 -r1.34
*** Options.py 25 Sep 2002 18:39:17 -0000 1.33
--- Options.py 27 Sep 2002 04:02:59 -0000 1.34
***************
*** 93,96 ****
--- 93,99 ----
mine_received_headers: False
+ # If set, the Message-Id is broken down into, hopefully, useful evidence.
+ mine_message_ids: False
+
[TestDriver]
# These control various displays in class TestDriver.Driver, and Tester.Test.
***************
*** 239,242 ****
--- 242,246 ----
'count_all_header_lines': boolean_cracker,
'mine_received_headers': boolean_cracker,
+ 'mine_message_ids': boolean_cracker,
'check_octets': boolean_cracker,
'octet_prefix_size': int_cracker,
From nascheme@users.sourceforge.net Fri Sep 27 05:06:15 2002
From: nascheme@users.sourceforge.net (Neil Schemenauer)
Date: Thu, 26 Sep 2002 21:06:15 -0700
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.40,1.41
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv4807
Modified Files:
tokenizer.py
Log Message:
Add basic message-id tokenization. Right now it just checks that it
exists and conforms to the usual syntax. If it does, the host part is
also returned. I tried doing more but the extra stuff was never
considered a good discriminator. Stupid wins again. :-)
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.40
retrieving revision 1.41
diff -C2 -d -r1.40 -r1.41
*** tokenizer.py 27 Sep 2002 01:28:43 -0000 1.40
--- tokenizer.py 27 Sep 2002 04:06:12 -0000 1.41
***************
*** 597,600 ****
--- 597,602 ----
received_ip_re = re.compile(r'\s[[(]((\d{1,3}\.?){4})[\])]')
+ message_id_re = re.compile(r'\s*<[^@]+@([^>]+)>\s*')
+
# I'm usually just splitting on whitespace, but for subject lines I want to
# break things like "Python/Perl comparison?" up. OTOH, I don't want to
***************
*** 981,984 ****
--- 983,996 ----
for tok in breakdown(m.group(1).lower()):
yield 'received:' + tok
+
+ if options.mine_message_ids:
+ msgid = msg.get("message-id", "")
+ m = message_id_re.match(msgid)
+ if not m:
+ # might be weird instead of invalid but who cares?
+ yield 'message-id:invalid'
+ else:
+ # looks okay, return the hostname only
+ yield 'message-id:@%s' % m.group(1)
# As suggested by Anthony Baxter, merely counting the number of
From anthonybaxter@users.sourceforge.net Fri Sep 27 09:36:06 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Fri, 27 Sep 2002 01:36:06 -0700
Subject: [Spambayes-checkins] spambayes TestDriver.py,1.14,1.15
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv11761
Modified Files:
TestDriver.py
Log Message:
more mixed line endings.
Index: TestDriver.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v
retrieving revision 1.14
retrieving revision 1.15
diff -C2 -d -r1.14 -r1.15
*** TestDriver.py 26 Sep 2002 01:10:29 -0000 1.14
--- TestDriver.py 27 Sep 2002 08:36:03 -0000 1.15
***************
*** 207,218 ****
if options.show_histograms:
printhist("all runs:", self.global_ham_hist, self.global_spam_hist)
!
! if options.save_histogram_pickles:
! for f, h in (('ham', self.global_ham_hist), ('spam', self.global_spam_hist)):
! fname = "%s_%shist.pik" % (options.pickle_basename, f)
! print " saving %s histogram pickle to %s" %(f, fname)
! fp = file(fname, 'wb')
! pickle.dump(h, fp, 1)
! fp.close()
def test(self, ham, spam):
--- 207,219 ----
if options.show_histograms:
printhist("all runs:", self.global_ham_hist, self.global_spam_hist)
!
! if options.save_histogram_pickles:
! for f, h in (('ham', self.global_ham_hist),
! ('spam', self.global_spam_hist)):
! fname = "%s_%shist.pik" % (options.pickle_basename, f)
! print " saving %s histogram pickle to %s" %(f, fname)
! fp = file(fname, 'wb')
! pickle.dump(h, fp, 1)
! fp.close()
def test(self, ham, spam):
From gvanrossum@users.sourceforge.net Fri Sep 27 19:48:07 2002
From: gvanrossum@users.sourceforge.net (Guido van Rossum)
Date: Fri, 27 Sep 2002 11:48:07 -0700
Subject: [Spambayes-checkins] spambayes hammie.py,1.21,1.22
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv22991
Modified Files:
hammie.py
Log Message:
Patch inspired by Alexander Leiding to support multiple -g, -s, -u
arguments.
Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.21
retrieving revision 1.22
diff -C2 -d -r1.21 -r1.22
*** hammie.py 24 Sep 2002 00:38:37 -0000 1.21
--- hammie.py 27 Sep 2002 18:48:05 -0000 1.22
***************
*** 11,18 ****
--- 11,21 ----
-g PATH
mbox or directory of known good messages (non-spam) to train on.
+ Can be specified more than once.
-s PATH
mbox or directory of known spam messages to train on.
+ Can be specified more than once.
-u PATH
mbox of unknown messages. A ham/spam decision is reported for each.
+ Can be specified more than once.
-p FILE
use file as the persistent store. loads data from this file if it
***************
*** 264,268 ****
pck = DEFAULTDB
! good = spam = unknown = None
do_filter = usedb = False
for opt, arg in opts:
--- 267,273 ----
pck = DEFAULTDB
! good = []
! spam = []
! unknown = []
do_filter = usedb = False
for opt, arg in opts:
***************
*** 270,276 ****
usage(0)
elif opt == '-g':
! good = arg
elif opt == '-s':
! spam = arg
elif opt == '-p':
pck = arg
--- 275,281 ----
usage(0)
elif opt == '-g':
! good.append(arg)
elif opt == '-s':
! spam.append(arg)
elif opt == '-p':
pck = arg
***************
*** 280,284 ****
do_filter = True
elif opt == '-u':
! unknown = arg
if args:
usage(2, "Positional arguments not allowed")
--- 285,289 ----
do_filter = True
elif opt == '-u':
! unknown.append(arg)
if args:
usage(2, "Positional arguments not allowed")
***************
*** 289,298 ****
if good:
! print "Training ham:"
! train(bayes, good, False)
save = True
if spam:
! print "Training spam:"
! train(bayes, spam, True)
save = True
--- 294,305 ----
if good:
! for g in good:
! print "Training ham (%s):" % g
! train(bayes, g, False)
save = True
if spam:
! for s in spam:
! print "Training spam (%s):" % s
! train(bayes, s, True)
save = True
***************
*** 308,312 ****
if unknown:
! score(bayes, unknown)
if __name__ == "__main__":
--- 315,322 ----
if unknown:
! for u in unknown:
! if len(unknown) > 1:
! print "Scoring", u
! score(bayes, u)
if __name__ == "__main__":
From npickett@users.sourceforge.net Fri Sep 27 20:40:27 2002
From: npickett@users.sourceforge.net (Neale Pickett)
Date: Fri, 27 Sep 2002 12:40:27 -0700
Subject: [Spambayes-checkins]
spambayes hammie.py,1.22,1.23 hammiesrv.py,1.2,1.3 runtest.sh,1.3,1.4
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv7026
Modified Files:
hammie.py hammiesrv.py runtest.sh
Log Message:
* hammie.py now has a Hammie class, which hammiesrv now uses.
hammie.py could still stand some more clean-up. Don't worry, I'm
on it :)
* runtest now has a run1 target to generate the first data
Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.22
retrieving revision 1.23
diff -C2 -d -r1.22 -r1.23
*** hammie.py 27 Sep 2002 18:48:05 -0000 1.22
--- hammie.py 27 Sep 2002 19:40:21 -0000 1.23
***************
*** 61,65 ****
class DBDict:
! """Database Dictionary
This wraps an anydbm to make it look even more like a dictionary.
--- 61,66 ----
class DBDict:
!
! """Database Dictionary.
This wraps an anydbm to make it look even more like a dictionary.
***************
*** 136,140 ****
class PersistentGrahamBayes(classifier.GrahamBayes):
! """A persistent GrahamBayes classifier
This is just like classifier.GrahamBayes, except that the dictionary
--- 137,142 ----
class PersistentGrahamBayes(classifier.GrahamBayes):
!
! """A persistent GrahamBayes classifier.
This is just like classifier.GrahamBayes, except that the dictionary
***************
*** 177,181 ****
! def train(bayes, msgs, is_spam):
"""Train bayes with all messages from a mailbox."""
mbox = mboxutils.getmbox(msgs)
--- 179,303 ----
! class Hammie:
!
! """A spambayes mail filter"""
!
! def __init__(self, bayes):
! self.bayes = bayes
!
! def _scoremsg(self, msg, evidence=False):
! """Score a Message.
!
! msg can be a string, a file object, or a Message object.
!
! Returns the probability the message is spam. If evidence is
! true, returns a tuple: (probability, clues), where clues is a
! list of the words which contributed to the score.
!
! """
!
! return self.bayes.spamprob(tokenize(msg), evidence)
!
! def formatclues(self, clues, sep="; "):
! """Format the clues into something readable."""
!
! return sep.join(["%r: %.2f" % (word, prob) for word, prob in clues])
!
! def score(self, msg, evidence=False):
! """Score (judge) a message.
!
! msg can be a string, a file object, or a Message object.
!
! Returns the probability the message is spam. If evidence is
! true, returns a tuple: (probability, clues), where clues is a
! list of the words which contributed to the score.
!
! """
!
! try:
! return self._scoremsg(msg, evidence)
! except:
! print msg
! import traceback
! traceback.print_exc()
!
! def filter(self, msg, header=DISPHEADER, cutoff=SPAM_THRESHOLD):
! """Score (judge) a message and add a disposition header.
!
! msg can be a string, a file object, or a Message object.
!
! Optionally, set header to the name of the header to add, and/or
! cutoff to the probability value which must be met or exceeded
! for a message to get a 'Yes' disposition.
!
! Returns the same message with a new disposition header.
!
! """
!
! if hasattr(msg, "readlines"):
! msg = email.message_from_file(msg)
! elif not hasattr(msg, "add_header"):
! msg = email.message_from_string(msg)
! prob, clues = self._scoremsg(msg, True)
! if prob < cutoff:
! disp = "No"
! else:
! disp = "Yes"
! disp += "; %.2f" % prob
! disp += "; " + self.formatclues(clues)
! msg.add_header(header, disp)
! return msg.as_string(unixfrom=(msg.get_unixfrom() is not None))
!
! def train(self, msg, is_spam):
! """Train bayes with a message.
!
! msg can be a string, a file object, or a Message object.
!
! is_spam should be 1 if the message is spam, 0 if not.
!
! Probabilities are not updated after this call is made; to do
! that, call update_probabilities().
!
! """
!
! self.bayes.learn(tokenize(msg), is_spam, False)
!
! def train_ham(self, msg):
! """Train bayes with ham.
!
! msg can be a string, a file object, or a Message object.
!
! Probabilities are not updated after this call is made; to do
! that, call update_probabilities().
!
! """
!
! self.train(msg, False)
!
! def train_spam(self, msg):
! """Train bayes with spam.
!
! msg can be a string, a file object, or a Message object.
!
! Probabilities are not updated after this call is made; to do
! that, call update_probabilities().
!
! """
!
! self.train(msg, True)
!
! def update_probabilities(self):
! """Update probability values.
!
! You would want to call this after a training session. It's
! pretty slow, so if you have a lot of messages to train, wait
! until you're all done before calling this.
!
! """
!
! self.bayes.update_probabilities()
!
!
! def train(hammie, msgs, is_spam):
"""Train bayes with all messages from a mailbox."""
mbox = mboxutils.getmbox(msgs)
***************
*** 187,211 ****
sys.stdout.write("\r%6d" % i)
sys.stdout.flush()
! bayes.learn(tokenize(msg), is_spam, False)
print
! def formatclues(clues, sep="; "):
! """Format the clues into something readable."""
! return sep.join(["%r: %.2f" % (word, prob) for word, prob in clues])
!
! def filter(bayes, input, output):
! """Filter (judge) a message"""
! msg = email.message_from_file(input)
! prob, clues = bayes.spamprob(tokenize(msg), True)
! if prob < SPAM_THRESHOLD:
! disp = "No"
! else:
! disp = "Yes"
! disp += "; %.2f" % prob
! disp += "; " + formatclues(clues)
! msg.add_header(DISPHEADER, disp)
! output.write(msg.as_string(unixfrom=(msg.get_unixfrom() is not None)))
!
! def score(bayes, msgs):
"""Score (judge) all messages from a mailbox."""
# XXX The reporting needs work!
--- 309,316 ----
sys.stdout.write("\r%6d" % i)
sys.stdout.flush()
! hammie.train(msg, is_spam)
print
! def score(hammie, msgs):
"""Score (judge) all messages from a mailbox."""
# XXX The reporting needs work!
***************
*** 215,219 ****
for msg in mbox:
i += 1
! prob, clues = bayes.spamprob(tokenize(msg), True)
isspam = prob >= SPAM_THRESHOLD
if hasattr(msg, '_mh_msgno'):
--- 320,324 ----
for msg in mbox:
i += 1
! prob, clues = hammie.score(msg, True)
isspam = prob >= SPAM_THRESHOLD
if hasattr(msg, '_mh_msgno'):
***************
*** 224,228 ****
spams += 1
print "%6s %4.2f %1s" % (msgno, prob, isspam and "S" or "."),
! print formatclues(clues)
else:
hams += 1
--- 329,333 ----
spams += 1
print "%6s %4.2f %1s" % (msgno, prob, isspam and "S" or "."),
! print hammie.formatclues(clues)
else:
hams += 1
***************
*** 292,309 ****
bayes = createbayes(pck, usedb)
! if good:
! for g in good:
! print "Training ham (%s):" % g
! train(bayes, g, False)
save = True
! if spam:
! for s in spam:
! print "Training spam (%s):" % s
! train(bayes, s, True)
save = True
if save:
! bayes.update_probabilities()
if not usedb and pck:
fp = open(pck, 'wb')
--- 397,414 ----
bayes = createbayes(pck, usedb)
+ h = Hammie(bayes)
! for g in good:
! print "Training ham (%s):" % g
! train(h, g, False)
save = True
!
! for s in spam:
! print "Training spam (%s):" % s
! train(h, s, True)
save = True
if save:
! h.update_probabilities()
if not usedb and pck:
fp = open(pck, 'wb')
***************
*** 312,316 ****
if do_filter:
! filter(bayes, sys.stdin, sys.stdout)
if unknown:
--- 417,423 ----
if do_filter:
! msg = sys.stdin.read()
! filtered = h.filter(msg)
! sys.stdout.write(filtered)
if unknown:
***************
*** 318,322 ****
if len(unknown) > 1:
print "Scoring", u
! score(bayes, u)
if __name__ == "__main__":
--- 425,429 ----
if len(unknown) > 1:
print "Scoring", u
! score(h, u)
if __name__ == "__main__":
Index: hammiesrv.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammiesrv.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** hammiesrv.py 23 Sep 2002 21:20:10 -0000 1.2
--- hammiesrv.py 27 Sep 2002 19:40:22 -0000 1.3
***************
*** 3,139 ****
# A server version of hammie.py
- # Server code
! import SimpleXMLRPCServer
! import email
! import hammie
! from tokenizer import tokenize
!
! # Default header to add
! DFL_HEADER = "X-Hammie-Disposition"
!
! # Default spam cutoff
! DFL_CUTOFF = 0.9
!
! class Hammie:
! def __init__(self, bayes):
! self.bayes = bayes
! def _scoremsg(self, msg, evidence=False):
! """Score an email.Message.
! Returns the probability the message is spam. If evidence is
! true, returns a tuple: (probability, clues), where clues is a
! list of the words which contributed to the score.
! """
! return self.bayes.spamprob(tokenize(msg), evidence)
! def score(self, msg, evidence=False):
! """Score (judge) a message.
! Pass in a message as a string.
! Returns the probability the message is spam. If evidence is
! true, returns a tuple: (probability, clues), where clues is a
! list of the words which contributed to the score.
"""
! return self._scoremsg(email.message_from_string(msg), evidence)
!
! def filter(self, msg, header=DFL_HEADER, cutoff=DFL_CUTOFF):
! """Score (judge) a message and add a disposition header.
!
! Pass in a message as a string. Optionally, set header to the
! name of the header to add, and/or cutoff to the probability
! value which must be met or exceeded for a message to get a 'Yes'
! disposition.
!
! Returns the same message with a new disposition header.
!
! """
! msg = email.message_from_string(msg)
! prob, clues = self._scoremsg(msg, True)
! if prob < cutoff:
! disp = "No"
else:
! disp = "Yes"
! disp += "; %.2f" % prob
! disp += "; " + hammie.formatclues(clues)
! msg.add_header(header, disp)
! return msg.as_string(unixfrom=(msg.get_unixfrom() is not None))
!
! def train(self, msg, is_spam):
! """Train bayes with a message.
!
! msg should be the message as a string, and is_spam should be 1
! if the message is spam, 0 if not.
!
! Probabilities are not updated after this call is made; to do
! that, call update_probabilities().
!
! """
!
! self.bayes.learn(tokenize(msg), is_spam, False)
!
! def train_ham(self, msg):
! """Train bayes with ham.
!
! msg should be the message as a string.
!
! Probabilities are not updated after this call is made; to do
! that, call update_probabilities().
!
! """
!
! self.train(msg, False)
!
! def train_spam(self, msg):
! """Train bayes with spam.
!
! msg should be the message as a string.
!
! Probabilities are not updated after this call is made; to do
! that, call update_probabilities().
!
! """
! self.train(msg, True)
! def update_probabilities(self):
! """Update probability values.
- You would want to call this after a training session. It's
- pretty slow, so if you have a lot of messages to train, wait
- until you're all done before calling this.
! """
! self.bayes.update_probabilites()
! def main():
! usedb = True
! pck = "/home/neale/lib/hammie.db"
! if usedb:
! bayes = hammie.PersistentGrahamBayes(pck)
! else:
! bayes = None
! try:
! fp = open(pck, 'rb')
! except IOError, e:
! if e.errno <> errno.ENOENT: raise
! else:
! bayes = pickle.load(fp)
! fp.close()
! if bayes is None:
! import classifier
! bayes = classifier.GrahamBayes()
! server = SimpleXMLRPCServer.SimpleXMLRPCServer(("localhost", 7732))
! server.register_instance(Hammie(bayes))
server.serve_forever()
--- 3,121 ----
# A server version of hammie.py
! """Usage: %(program)s [options] IP:PORT
! Where:
! -h
! show usage and exit
! -p FILE
! use file as the persistent store. loads data from this file if it
! exists, and saves data to this file at the end. Default: %(DEFAULTDB)s
! -d
! use the DBM store instead of cPickle. The file is larger and
! creating it is slower, but checking against it is much faster,
! especially for large word databases.
! IP
! IP address to bind (use 0.0.0.0 to listen on all IPs of this machine)
! PORT
! Port number to listen to.
! """
! import SimpleXMLRPCServer
! import getopt
! import sys
! import traceback
! import xmlrpclib
! import hammie
! program = sys.argv[0] # For usage(); referenced by docstring above
! # Default DB path
! DEFAULTDB = hammie.DEFAULTDB
! class HammieHandler(SimpleXMLRPCServer.SimpleXMLRPCRequestHandler):
! def do_POST(self):
! """Handles the HTTP POST request.
! Attempts to interpret all HTTP POST requests as XML-RPC calls,
! which are forwarded to the _dispatch method for handling.
+ This one also prints out tracebacks, to help me debug :)
"""
! try:
! # get arguments
! data = self.rfile.read(int(self.headers["content-length"]))
! params, method = xmlrpclib.loads(data)
! # generate response
! try:
! response = self._dispatch(method, params)
! # wrap response in a singleton tuple
! response = (response,)
! except:
! # report exception back to server
! response = xmlrpclib.dumps(
! xmlrpclib.Fault(1, "%s:%s" % (sys.exc_type, sys.exc_value))
! )
! else:
! response = xmlrpclib.dumps(response, methodresponse=1)
! except:
! # internal error, report as HTTP server error
! traceback.print_exc()
! print `data`
! self.send_response(500)
! self.end_headers()
else:
! # got a valid XML RPC response
! self.send_response(200)
! self.send_header("Content-type", "text/xml")
! self.send_header("Content-length", str(len(response)))
! self.end_headers()
! self.wfile.write(response)
! # shut down the connection
! self.wfile.flush()
! self.connection.shutdown(1)
!
! def usage(code, msg=''):
! """Print usage message and sys.exit(code)."""
! if msg:
! print >> sys.stderr, msg
! print >> sys.stderr
! print >> sys.stderr, __doc__ % globals()
! sys.exit(code)
! def main():
! """Main program; parse options and go."""
! try:
! opts, args = getopt.getopt(sys.argv[1:], 'hdp:')
! except getopt.error, msg:
! usage(2, msg)
! pck = DEFAULTDB
! usedb = False
! for opt, arg in opts:
! if opt == '-h':
! usage(0)
! elif opt == '-p':
! pck = arg
! elif opt == "-d":
! usedb = True
! if len(args) != 1:
! usage(2, "IP:PORT not specified")
! ip, port = args[0].split(":")
! port = int(port)
!
! bayes = hammie.createbayes(pck, usedb)
! h = hammie.Hammie(bayes)
! server = SimpleXMLRPCServer.SimpleXMLRPCServer((ip, port), HammieHandler)
! server.register_instance(h)
server.serve_forever()
Index: runtest.sh
===================================================================
RCS file: /cvsroot/spambayes/spambayes/runtest.sh,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** runtest.sh 19 Sep 2002 00:17:41 -0000 1.3
--- runtest.sh 27 Sep 2002 19:40:22 -0000 1.4
***************
*** 40,43 ****
--- 40,46 ----
case "$TEST" in
+ run1)
+ python timcv.py -n $SETS > run1.txt
+ ;;
run2|useold)
python timcv.py -n $SETS > run2.txt
From tim_one@users.sourceforge.net Fri Sep 27 22:04:08 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Fri, 27 Sep 2002 14:04:08 -0700
Subject: [Spambayes-checkins]
spambayes HistToGNU.py,1.5,1.6 TestDriver.py,1.15,1.16
hammie.py,1.23,1.24 hammiesrv.py,1.3,1.4 setup.py,1.5,1.6
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv3485
Modified Files:
HistToGNU.py TestDriver.py hammie.py hammiesrv.py setup.py
Log Message:
Whitespace normalization, prior to tagging.
Index: HistToGNU.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/HistToGNU.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** HistToGNU.py 26 Sep 2002 08:47:29 -0000 1.5
--- HistToGNU.py 27 Sep 2002 21:04:05 -0000 1.6
***************
*** 62,66 ****
cmd.write('pause 100\n')
print cmd.getvalue()
!
def main():
import getopt
--- 62,66 ----
cmd.write('pause 100\n')
print cmd.getvalue()
!
def main():
import getopt
***************
*** 77,81 ****
fname = "%s_%shist.pik" % (options.pickle_basename, f)
args.append(fname)
!
if args:
plot(args)
--- 77,81 ----
fname = "%s_%shist.pik" % (options.pickle_basename, f)
args.append(fname)
!
if args:
plot(args)
Index: TestDriver.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v
retrieving revision 1.15
retrieving revision 1.16
diff -C2 -d -r1.15 -r1.16
*** TestDriver.py 27 Sep 2002 08:36:03 -0000 1.15
--- TestDriver.py 27 Sep 2002 21:04:06 -0000 1.16
***************
*** 209,213 ****
if options.save_histogram_pickles:
! for f, h in (('ham', self.global_ham_hist),
('spam', self.global_spam_hist)):
fname = "%s_%shist.pik" % (options.pickle_basename, f)
--- 209,213 ----
if options.save_histogram_pickles:
! for f, h in (('ham', self.global_ham_hist),
('spam', self.global_spam_hist)):
fname = "%s_%shist.pik" % (options.pickle_basename, f)
Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.23
retrieving revision 1.24
diff -C2 -d -r1.23 -r1.24
*** hammie.py 27 Sep 2002 19:40:21 -0000 1.23
--- hammie.py 27 Sep 2002 21:04:06 -0000 1.24
***************
*** 182,186 ****
"""A spambayes mail filter"""
!
def __init__(self, bayes):
self.bayes = bayes
--- 182,186 ----
"""A spambayes mail filter"""
!
def __init__(self, bayes):
self.bayes = bayes
***************
*** 198,202 ****
return self.bayes.spamprob(tokenize(msg), evidence)
!
def formatclues(self, clues, sep="; "):
"""Format the clues into something readable."""
--- 198,202 ----
return self.bayes.spamprob(tokenize(msg), evidence)
!
def formatclues(self, clues, sep="; "):
"""Format the clues into something readable."""
***************
*** 230,234 ****
cutoff to the probability value which must be met or exceeded
for a message to get a 'Yes' disposition.
!
Returns the same message with a new disposition header.
--- 230,234 ----
cutoff to the probability value which must be met or exceeded
for a message to get a 'Yes' disposition.
!
Returns the same message with a new disposition header.
***************
*** 258,264 ****
Probabilities are not updated after this call is made; to do
that, call update_probabilities().
!
"""
!
self.bayes.learn(tokenize(msg), is_spam, False)
--- 258,264 ----
Probabilities are not updated after this call is made; to do
that, call update_probabilities().
!
"""
!
self.bayes.learn(tokenize(msg), is_spam, False)
***************
*** 295,301 ****
"""
!
self.bayes.update_probabilities()
!
def train(hammie, msgs, is_spam):
--- 295,301 ----
"""
!
self.bayes.update_probabilities()
!
def train(hammie, msgs, is_spam):
Index: hammiesrv.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammiesrv.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** hammiesrv.py 27 Sep 2002 19:40:22 -0000 1.3
--- hammiesrv.py 27 Sep 2002 21:04:06 -0000 1.4
***************
*** 79,83 ****
self.wfile.flush()
self.connection.shutdown(1)
!
def usage(code, msg=''):
--- 79,83 ----
self.wfile.flush()
self.connection.shutdown(1)
!
def usage(code, msg=''):
***************
*** 112,116 ****
ip, port = args[0].split(":")
port = int(port)
!
bayes = hammie.createbayes(pck, usedb)
h = hammie.Hammie(bayes)
--- 112,116 ----
ip, port = args[0].split(":")
port = int(port)
!
bayes = hammie.createbayes(pck, usedb)
h = hammie.Hammie(bayes)
Index: setup.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/setup.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** setup.py 24 Sep 2002 18:07:17 -0000 1.5
--- setup.py 27 Sep 2002 21:04:06 -0000 1.6
***************
*** 2,6 ****
setup(
! name='spambayes',
scripts=['unheader.py',
'hammie.py',
--- 2,6 ----
setup(
! name='spambayes',
scripts=['unheader.py',
'hammie.py',
From tim_one@users.sourceforge.net Fri Sep 27 22:18:20 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Fri, 27 Sep 2002 14:18:20 -0700
Subject: [Spambayes-checkins] spambayes TestDriver.py,1.16,1.17
Tester.py,1.4,1.5
classifier.py,1.20,1.21 hammie.py,1.24,1.25 neiltrain.py,1.2,1.3
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv8335
Modified Files:
TestDriver.py Tester.py classifier.py hammie.py neiltrain.py
Log Message:
Renamed class GrahamBayes to Bayes. hammie.py may with to rename its
derived class similarly.
Index: TestDriver.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v
retrieving revision 1.16
retrieving revision 1.17
diff -C2 -d -r1.16 -r1.17
*** TestDriver.py 27 Sep 2002 21:04:06 -0000 1.16
--- TestDriver.py 27 Sep 2002 21:18:18 -0000 1.17
***************
*** 161,165 ****
def new_classifier(self):
! c = self.classifier = classifier.GrahamBayes()
self.tester = Tester.Test(c)
self.trained_ham_hist = Hist(options.nbuckets)
--- 161,165 ----
def new_classifier(self):
! c = self.classifier = classifier.Bayes()
self.tester = Tester.Test(c)
self.trained_ham_hist = Hist(options.nbuckets)
Index: Tester.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Tester.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** Tester.py 19 Sep 2002 06:30:15 -0000 1.4
--- Tester.py 27 Sep 2002 21:18:18 -0000 1.5
***************
*** 2,6 ****
class Test:
! # Pass a classifier instance (an instance of GrahamBayes).
# Loop:
# # Train the classifer with new ham and spam.
--- 2,6 ----
class Test:
! # Pass a classifier instance (an instance of Bayes).
# Loop:
# # Train the classifer with new ham and spam.
***************
*** 128,132 ****
_easy_test = """
! >>> from classifier import GrahamBayes
>>> good1 = _Example('', ['a', 'b', 'c'] * 10)
--- 128,132 ----
_easy_test = """
! >>> from classifier import Bayes
>>> good1 = _Example('', ['a', 'b', 'c'] * 10)
***************
*** 134,138 ****
>>> bad1 = _Example('', ['d'] * 10)
! >>> t = Test(GrahamBayes())
>>> t.train([good1, good2], [bad1])
>>> t.predict([_Example('goodham', ['a', 'b']),
--- 134,138 ----
>>> bad1 = _Example('', ['d'] * 10)
! >>> t = Test(Bayes())
>>> t.train([good1, good2], [bad1])
>>> t.predict([_Example('goodham', ['a', 'b']),
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.20
retrieving revision 1.21
diff -C2 -d -r1.20 -r1.21
*** classifier.py 24 Sep 2002 22:14:01 -0000 1.20
--- classifier.py 27 Sep 2002 21:18:18 -0000 1.21
***************
*** 217,221 ****
self.spamprob) = t
! class GrahamBayes(object):
__slots__ = ('wordinfo', # map word to WordInfo record
'nspam', # number of spam messages learn() has seen
--- 217,221 ----
self.spamprob) = t
! class Bayes(object):
__slots__ = ('wordinfo', # map word to WordInfo record
'nspam', # number of spam messages learn() has seen
Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.24
retrieving revision 1.25
diff -C2 -d -r1.24 -r1.25
*** hammie.py 27 Sep 2002 21:04:06 -0000 1.24
--- hammie.py 27 Sep 2002 21:18:18 -0000 1.25
***************
*** 136,144 ****
! class PersistentGrahamBayes(classifier.GrahamBayes):
! """A persistent GrahamBayes classifier.
! This is just like classifier.GrahamBayes, except that the dictionary
is a database. You take less disk this way, I think, and you can
pretend it's persistent. It's much slower training, but much faster
--- 136,144 ----
! class PersistentGrahamBayes(classifier.Bayes):
! """A persistent Bayes classifier.
! This is just like classifier.Bayes, except that the dictionary
is a database. You take less disk this way, I think, and you can
pretend it's persistent. It's much slower training, but much faster
***************
*** 161,165 ****
def __init__(self, dbname):
! classifier.GrahamBayes.__init__(self)
self.statekey = "saved state"
self.wordinfo = DBDict(dbname, (self.statekey,))
--- 161,165 ----
def __init__(self, dbname):
! classifier.Bayes.__init__(self)
self.statekey = "saved state"
self.wordinfo = DBDict(dbname, (self.statekey,))
***************
*** 335,339 ****
def createbayes(pck=DEFAULTDB, usedb=False):
! """Create a GrahamBayes instance for the given pickle (which
doesn't have to exist). Create a PersistentGrahamBayes if
usedb is True."""
--- 335,339 ----
def createbayes(pck=DEFAULTDB, usedb=False):
! """Create a Bayes instance for the given pickle (which
doesn't have to exist). Create a PersistentGrahamBayes if
usedb is True."""
***************
*** 350,354 ****
fp.close()
if bayes is None:
! bayes = classifier.GrahamBayes()
return bayes
--- 350,354 ----
fp.close()
if bayes is None:
! bayes = classifier.Bayes()
return bayes
Index: neiltrain.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/neiltrain.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** neiltrain.py 20 Sep 2002 19:32:26 -0000 1.2
--- neiltrain.py 27 Sep 2002 21:18:18 -0000 1.3
***************
*** 39,43 ****
ham_name = sys.argv[2]
db_name = sys.argv[3]
! bayes = classifier.GrahamBayes()
print 'Training with spam...'
train(bayes, spam_name, True)
--- 39,43 ----
ham_name = sys.argv[2]
db_name = sys.argv[3]
! bayes = classifier.Bayes()
print 'Training with spam...'
train(bayes, spam_name, True)
From tim_one@users.sourceforge.net Fri Sep 27 23:29:58 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Fri, 27 Sep 2002 15:29:58 -0700
Subject: [Spambayes-checkins]
spambayes Options.py,1.34,1.35 classifier.py,1.21,1.22
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv29156
Modified Files:
Options.py classifier.py
Log Message:
Gary's "f(w)" scheme is now the default, and code unique to the
Graham scheme has gone away (but was tagged with Last-Graham).
These options have vanished:
hambias
spambias
min_spamprob
max_spamprob
unknown_word_spamprob
use_robinson_combining
use_robinson_probability
use_robinson_ranking
These options have changed default value:
robinson_probability_a: 0.225 (was 1.0)
robinson_minimum_prob_strength: 0.1 (was 0.0)
max_discriminators: 150 (was 16)
spam_cutoff: 0.570 (was 0.90) # THIS IS CORPUS-DEPENDENT!
In addition, I did a little long-overdue refactoring of the classifier
internals. The visible interface hasn't changed.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.34
retrieving revision 1.35
diff -C2 -d -r1.34 -r1.35
*** Options.py 27 Sep 2002 04:02:59 -0000 1.34
--- Options.py 27 Sep 2002 22:29:56 -0000 1.35
***************
*** 100,110 ****
# A message is considered spam iff it scores greater than spam_cutoff.
! # If using Graham's combining scheme, 0.90 seems to work best for "small"
! # training sets. As the size of the training sets increase, there's not
! # yet any bound in sight for how low this can go (0.075 would work as
! # well as 0.90 on Tim's large c.l.py data).
! # For Gary Robinson's scheme, some value between 0.50 and 0.60 has worked
! # best in all reports so far.
! spam_cutoff: 0.90
# Number of buckets in histograms.
--- 100,106 ----
# A message is considered spam iff it scores greater than spam_cutoff.
! # This is corpus-dependent, and values into the .600's have been known
! # to work best on some data.
! spam_cutoff: 0.570
# Number of buckets in histograms.
***************
*** 174,219 ****
[Classifier]
! # Fiddling these can have extreme effects. See classifier.py for comments.
! hambias: 2.0
! spambias: 1.0
!
! min_spamprob: 0.01
! max_spamprob: 0.99
! unknown_spamprob: 0.5
!
! max_discriminators: 16
!
! ###########################################################################
! # Speculative options for Gary Robinson's ideas. These may go away, or
! # a bunch of incompatible stuff above may go away.
!
! # Use Gary's scheme for combining probabilities.
! use_robinson_combining: False
! # Use Gary's scheme for computing probabilities, along with its "a" and
! # "x" parameters.
! use_robinson_probability: False
! robinson_probability_a: 1.0
robinson_probability_x: 0.5
- # Use Gary's scheme for ranking probabilities.
- use_robinson_ranking: False
-
# When scoring a message, ignore all words with
# abs(word.spamprob - 0.5) < robinson_minimum_prob_strength.
! # By default (0.0), nothing is ignored.
! # Tim got a pretty clear improvement in f-n rate on his hasn't-improved-in-
! # a-long-time large c.l.py test by using 0.1. No other values have been
! # tried yet.
! # Neil Schemenauer also reported good results from 0.1, making the all-
! # Robinson scheme match the all-default Graham-like scheme on a smaller
! # and different corpus.
! # NOTE: Changing this may change the best spam_cutoff value for your
! # corpus. Since one effect is to separate the means more, you'll probably
! # want a higher spam_cutoff.
! robinson_minimum_prob_strength: 0.0
###########################################################################
! # More speculative options for Gary Robinson's central-limit. These may go
# away, or a bunch of incompatible stuff above may go away.
--- 170,204 ----
[Classifier]
! # The maximum number of extreme words to look at in a msg, where "extreme"
! # means with spamprob farthest away from 0.5. 150 appears to work well
! # across all corpora tested.
! max_discriminators: 150
! # These two control the prior assumption about word probabilities.
! # "x" is essentially the probability given to a word that's never been
! # seen before. Nobody has reported an improvement via moving it away
! # from 1/2.
! # "a" adjusts how much weight to give the prior assumption relative to
! # the probabilities estimated by counting. At a=0, the counting estimates
! # are believed 100%, even to the extent of assigning certainty (0 or 1)
! # to a word that's appeared in only ham or only spam. This is a disaster.
! # As "a" tends toward infintity, all probabilities tend toward "x". All
! # reports were that a value near 0.2 worked best, so this doesn't seem to
! # be corpus-dependent.
! # XXX Gary Robinson has since renamed "a" to "s", and redone his formulas
! # XXX to make it a measure of belief strength rather than "a number" from
! # XXX 0 to infinity. We haven't caught up to that yet.
! robinson_probability_a: 0.225
robinson_probability_x: 0.5
# When scoring a message, ignore all words with
# abs(word.spamprob - 0.5) < robinson_minimum_prob_strength.
! # This may be a hack, but it has proved to reduce error rates in many
! # tests over Robinson's base scheme. 0.1 appeared to work well across
! # all corpora.
! robinson_minimum_prob_strength: 0.1
###########################################################################
! # Speculative options for Gary Robinson's central-limit ideas. These may go
# away, or a bunch of incompatible stuff above may go away.
***************
*** 268,282 ****
'best_cutoff_fp_weight': float_cracker,
},
! 'Classifier': {'hambias': float_cracker,
! 'spambias': float_cracker,
! 'min_spamprob': float_cracker,
! 'max_spamprob': float_cracker,
! 'unknown_spamprob': float_cracker,
! 'max_discriminators': int_cracker,
! 'use_robinson_combining': boolean_cracker,
! 'use_robinson_probability': boolean_cracker,
'robinson_probability_a': float_cracker,
'robinson_probability_x': float_cracker,
- 'use_robinson_ranking': boolean_cracker,
'robinson_minimum_prob_strength': float_cracker,
--- 253,259 ----
'best_cutoff_fp_weight': float_cracker,
},
! 'Classifier': {'max_discriminators': int_cracker,
'robinson_probability_a': float_cracker,
'robinson_probability_x': float_cracker,
'robinson_minimum_prob_strength': float_cracker,
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.21
retrieving revision 1.22
diff -C2 -d -r1.21 -r1.22
*** classifier.py 27 Sep 2002 21:18:18 -0000 1.21
--- classifier.py 27 Sep 2002 22:29:56 -0000 1.22
***************
*** 1,178 ****
! # This is an implementation of the Bayes-like spam classifier sketched
! # by Paul Graham at . We say
! # "Bayes-like" because there are many ad hoc deviations from a
! # "normal" Bayesian classifier.
! #
! # This implementation is due to Tim Peters et alia.
!
! import time
! from heapq import heapreplace
! from sets import Set
!
! from Options import options
!
! # The count of each word in ham is artificially boosted by a factor of
! # HAMBIAS, and similarly for SPAMBIAS. Graham uses 2.0 and 1.0. Final
! # results are very sensitive to the HAMBIAS value. On my 5x5 c.l.py
! # test grid with 20,000 hams and 13,750 spams split into 5 pairs, then
! # across all 20 test runs (for each pair, training on that pair then scoring
! # against the other 4 pairs), and counting up all the unique msgs ever
! # identified as false negative or positive, then compared to HAMBIAS 2.0,
! #
! # At HAMBIAS 1.0
! # total unique false positives goes up by a factor of 7.6 ( 23 -> 174)
! # total unique false negatives goes down by a factor of 2 (337 -> 166)
! #
! # At HAMBIAS 3.0
! # total unique false positives goes down by a factor of 4.6 ( 23 -> 5)
! # total unique false negatives goes up by a factor of 2.1 (337 -> 702)
!
! HAMBIAS = options.hambias # 2.0
! SPAMBIAS = options.spambias # 1.0
!
! # "And then there is the question of what probability to assign to words
! # that occur in one corpus but not the other. Again by trial and error I
! # chose .01 and .99.". However, the code snippet clamps *all* probabilities
! # into this range. That's good in principle (IMO), because no finite amount
! # of training data is good enough to justify probabilities of 0 or 1. It
! # may justify probabilities outside this range, though.
! MIN_SPAMPROB = options.min_spamprob # 0.01
! MAX_SPAMPROB = options.max_spamprob # 0.99
!
! # The spam probability assigned to words never seen before. Graham used
! # 0.2 here. Neil Schemenauer reported that 0.5 seemed to work better. In
! # Tim's content-only tests (no headers), boosting to 0.5 cut the false
! # negative rate by over 1/3. The f-p rate increased, but there were so few
! # f-ps that the increase wasn't statistically significant. It also caught
! # 13 more spams erroneously classified as ham. By eyeball (and common
! # sense ), this has most effect on very short messages, where there
! # simply aren't many high-value words. A word with prob 0.5 is (in effect)
! # completely ignored by spamprob(), in favor of *any* word with *any* prob
! # differing from 0.5. At 0.2, an unknown word favors ham at the expense
! # of kicking out a word with a prob in (0.2, 0.8), and that seems dubious
! # on the face of it.
! UNKNOWN_SPAMPROB = options.unknown_spamprob # 0.5
!
! # "I only consider words that occur more than five times in total".
! # But the code snippet considers words that appear at least five times.
! # This implementation follows the code rather than the explanation.
! # (In addition, the count compared is after multiplying it with the
! # appropriate bias factor.)
! #
! # Twist: Graham used MINCOUNT=5.0 here. I got rid of it: in effect,
! # given HAMBIAS=2.0, it meant we ignored a possibly perfectly good piece
! # of spam evidence unless it appeared at least 5 times, and ditto for
! # ham evidence unless it appeared at least 3 times. That certainly does
! # bias in favor of ham, but multiple distortions in favor of ham are
! # multiple ways to get confused and trip up. Here are the test results
! # before and after, MINCOUNT=5.0 on the left, no MINCOUNT on the right;
! # ham sets had 4000 msgs (so 0.025% is one msg), and spam sets 2750:
! #
! # false positive percentages
! # 0.000 0.000 tied
! # 0.000 0.000 tied
! # 0.100 0.050 won -50.00%
! # 0.000 0.025 lost +(was 0)
! # 0.025 0.075 lost +200.00%
! # 0.025 0.000 won -100.00%
! # 0.100 0.100 tied
! # 0.025 0.050 lost +100.00%
! # 0.025 0.025 tied
! # 0.050 0.025 won -50.00%
! # 0.100 0.050 won -50.00%
! # 0.025 0.050 lost +100.00%
! # 0.025 0.050 lost +100.00%
! # 0.025 0.000 won -100.00%
! # 0.025 0.000 won -100.00%
! # 0.025 0.075 lost +200.00%
! # 0.025 0.025 tied
! # 0.000 0.000 tied
! # 0.025 0.025 tied
! # 0.100 0.050 won -50.00%
#
! # won 7 times
! # tied 7 times
! # lost 6 times
#
! # total unique fp went from 9 to 13
#
! # false negative percentages
! # 0.364 0.327 won -10.16%
! # 0.400 0.400 tied
! # 0.400 0.327 won -18.25%
! # 0.909 0.691 won -23.98%
! # 0.836 0.545 won -34.81%
! # 0.618 0.291 won -52.91%
! # 0.291 0.218 won -25.09%
! # 1.018 0.654 won -35.76%
! # 0.982 0.364 won -62.93%
! # 0.727 0.291 won -59.97%
! # 0.800 0.327 won -59.13%
! # 1.163 0.691 won -40.58%
! # 0.764 0.582 won -23.82%
! # 0.473 0.291 won -38.48%
! # 0.473 0.364 won -23.04%
! # 0.727 0.436 won -40.03%
! # 0.655 0.436 won -33.44%
! # 0.509 0.218 won -57.17%
! # 0.545 0.291 won -46.61%
! # 0.509 0.254 won -50.10%
#
! # won 19 times
! # tied 1 times
! # lost 0 times
#
! # total unique fn went from 168 to 106
#
! # So dropping MINCOUNT was a huge win for the f-n rate, and a mixed bag
! # for the f-p rate (but the f-p rate was so low compared to 4000 msgs that
! # even the losses were barely significant). In addition, dropping MINCOUNT
! # had a larger good effect when using random training subsets of size 500;
! # this makes intuitive sense, as with less training data it was harder to
! # exceed the MINCOUNT threshold.
#
! # Still, MINCOUNT seemed to be a gross approximation to *something* valuable:
! # a strong clue appearing in 1,000 training msgs is certainly more trustworthy
! # than an equally strong clue appearing in only 1 msg. I'm almost certain it
! # would pay to develop a way to take that into account when scoring. In
! # particular, there was a very specific new class of false positives
! # introduced by dropping MINCOUNT: some c.l.py msgs consisting mostly of
! # Spanish or French. The "high probability" spam clues were innocuous
! # words like "puedo" and "como", that appeared in very rare Spanish and
! # French spam too. There has to be a more principled way to address this
! # than the MINCOUNT hammer, and the test results clearly showed that MINCOUNT
! # did more harm than good overall.
! # The maximum number of words spamprob() pays attention to. Graham had 15
! # here. If there are 8 indicators with spam probabilities near 1, and 7
! # near 0, the math is such that the combined result is near 1. Making this
! # even gets away from that oddity (8 of each allows for graceful ties,
! # which favor ham).
! #
! # XXX That should be revisited. Stripping HTML tags from plain text msgs
! # XXX later addressed some of the same problem cases. The best value for
! # XXX MAX_DISCRIMINATORS remains unknown, but increasing it a lot is known
! # XXX to hurt.
! # XXX Later: tests after cutting this back to 15 showed no effect on the
! # XXX f-p rate, and a tiny shift in the f-n rate (won 3 times, tied 8 times,
! # XXX lost 9 times). There isn't a significant difference, so leaving it
! # XXX at 16.
! #
! # A twist: When staring at failures, it wasn't unusual to see the top
! # discriminators *all* have values of MIN_SPAMPROB and MAX_SPAMPROB. The
! # math is such that one MIN_SPAMPROB exactly cancels out one MAX_SPAMPROB,
! # yielding no info at all. Then whichever flavor of clue happened to reach
! # MAX_DISCRIMINATORS//2 + 1 occurrences first determined the final outcome,
! # based on almost no real evidence.
! #
! # So spamprob() was changed to save lists of *all* MIN_SPAMPROB and
! # MAX_SPAMPROB clues. If the number of those are equal, they're all ignored.
! # Else the flavor with the smaller number of instances "cancels out" the
! # same number of instances of the other flavor, and the remaining instances
! # of the other flavor are fed into the probability computation. This change
! # was a pure win, lowering the false negative rate consistently, and it even
! # managed to tickle a couple rare false positives into "not spam" terrority.
! MAX_DISCRIMINATORS = options.max_discriminators # 16
PICKLE_VERSION = 1
--- 1,36 ----
! # An implementation of a Bayes-like spam classifier.
#
! # Paul Graham's original description:
#
! # http://www.paulgraham.com/spam.html
#
! # A highly fiddled version of that can be retrieved from our CVS repository,
! # via tag Last-Graham. This made many demonstrated improvements in error
! # rates over Paul's original description.
#
! # This code implements Gary Robinson's suggestions, which are well explained
! # on his webpage:
#
! # http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
#
! # This is theoretically cleaner, and in testing has performed at least as
! # well as our highly tuned Graham scheme did, often slightly better, and
! # sometimes much better. It also has "a middle ground", which people like:
! # the scores under Paul's scheme were almost always very near 0 or very near
! # 1, whether or not the classification was correct. The false positives
! # and false negatives under Gary's scheme generally score in a narrow range
! # around the corpus's best spam_cutoff value
#
! # This implementation is due to Tim Peters et alia.
+ import time
+ from heapq import heapreplace
+ from sets import Set
! from Options import options
!
! # The maximum number of extreme words to look at in a msg, where "extreme"
! # means with spamprob farthest away from 0.5.
! MAX_DISCRIMINATORS = options.max_discriminators # 150
PICKLE_VERSION = 1
***************
*** 273,359 ****
"""
! # A priority queue to remember the MAX_DISCRIMINATORS best
! # probabilities, where "best" means largest distance from 0.5.
! # The tuples are (distance, prob, word, wordinfo[word]).
! nbest = [(-1.0, None, None, None)] * MAX_DISCRIMINATORS
! smallest_best = -1.0
!
! wordinfoget = self.wordinfo.get
! now = time.time()
! mins = [] # all words w/ prob MIN_SPAMPROB
! maxs = [] # all words w/ prob MAX_SPAMPROB
! # Counting a unique word multiple times hurts, although counting one
! # at most two times had some benefit whan UNKNOWN_SPAMPROB was 0.2.
! # When that got boosted to 0.5, counting more than once became
! # counterproductive.
! for word in Set(wordstream):
! record = wordinfoget(word)
! if record is None:
! prob = UNKNOWN_SPAMPROB
! else:
! record.atime = now
! prob = record.spamprob
!
! distance = abs(prob - 0.5)
! if prob == MIN_SPAMPROB:
! mins.append((distance, prob, word, record))
! elif prob == MAX_SPAMPROB:
! maxs.append((distance, prob, word, record))
! elif distance > smallest_best:
! # Subtle: we didn't use ">" instead of ">=" just to save
! # calls to heapreplace(). The real intent is that if
! # there are many equally strong indicators throughout the
! # message, we want to favor the ones that appear earliest:
! # it's expected that spam headers will often have smoking
! # guns, and, even when not, spam has to grab your attention
! # early (& note that when spammers generate large blocks of
! # random gibberish to throw off exact-match filters, it's
! # always at the end of the msg -- if they put it at the
! # start, *nobody* would read the msg).
! heapreplace(nbest, (distance, prob, word, record))
! smallest_best = nbest[0][0]
!
! # Compute the probability. Note: This is what Graham's code did,
! # but it's dubious for reasons explained in great detail on Python-
! # Dev: it's missing P(spam) and P(not-spam) adjustments that
! # straightforward Bayesian analysis says should be here. It's
! # unclear how much it matters, though, as the omissions here seem
! # to tend in part to cancel out distortions introduced earlier by
! # HAMBIAS. Experiments will decide the issue.
! clues = []
! # First cancel out competing extreme clues (see comment block at
! # MAX_DISCRIMINATORS declaration -- this is a twist on Graham).
! if mins or maxs:
! if len(mins) < len(maxs):
! shorter, longer = mins, maxs
! else:
! shorter, longer = maxs, mins
! tokeep = min(len(longer) - len(shorter), MAX_DISCRIMINATORS)
! # They're all good clues, but we're only going to feed the tokeep
! # initial clues from the longer list into the probability
! # computation.
! for dist, prob, word, record in shorter + longer[tokeep:]:
! record.killcount += 1
! if evidence:
! clues.append((word, prob))
! for x in longer[:tokeep]:
! heapreplace(nbest, x)
! prob_product = inverse_prob_product = 1.0
! for distance, prob, word, record in nbest:
! if prob is None: # it's one of the dummies nbest started with
! continue
if record is not None: # else wordinfo doesn't know about it
record.killcount += 1
! if evidence:
! clues.append((word, prob))
! prob_product *= prob
! inverse_prob_product *= 1.0 - prob
! prob = prob_product / (prob_product + inverse_prob_product)
if evidence:
! clues.sort(lambda a, b: cmp(a[1], b[1]))
return prob, clues
else:
--- 131,184 ----
"""
! from math import frexp
! # This combination method is due to Gary Robinson; see
! # http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
! # The real P = this P times 2**Pexp. Likewise for Q. We're
! # simulating unbounded dynamic float range by hand. If this pans
! # out, *maybe* we should store logarithms in the database instead
! # and just add them here. But I like keeping raw counts in the
! # database (they're easy to understand, manipulate and combine),
! # and there's no evidence that this simulation is a significant
! # expense.
! P = Q = 1.0
! Pexp = Qexp = 0
! clues = self._getclues(wordstream)
! for prob, word, record in clues:
if record is not None: # else wordinfo doesn't know about it
record.killcount += 1
! P *= 1.0 - prob
! Q *= prob
! if P < 1e-200: # move back into range
! P, e = frexp(P)
! Pexp += e
! if Q < 1e-200: # move back into range
! Q, e = frexp(Q)
! Qexp += e
! P, e = frexp(P)
! Pexp += e
! Q, e = frexp(Q)
! Qexp += e
!
! num_clues = len(clues)
! if num_clues:
! #P = 1.0 - P**(1./num_clues)
! #Q = 1.0 - Q**(1./num_clues)
! #
! # (x*2**e)**n = x**n * 2**(e*n)
! n = 1.0 / num_clues
! P = 1.0 - P**n * 2.0**(Pexp * n)
! Q = 1.0 - Q**n * 2.0**(Qexp * n)
!
! prob = (P-Q)/(P+Q) # in -1 .. 1
! prob = 0.5 + prob/2 # shift to 0 .. 1
! else:
! prob = 0.5
if evidence:
! clues.sort()
! clues = [(w, p) for p, w, r in clues]
return prob, clues
else:
***************
*** 403,418 ****
nham = float(self.nham or 1)
nspam = float(self.nspam or 1)
! for word,record in self.wordinfo.iteritems():
# Compute prob(msg is spam | msg contains word).
! hamcount = min(HAMBIAS * record.hamcount, nham)
! spamcount = min(SPAMBIAS * record.spamcount, nspam)
hamratio = hamcount / nham
spamratio = spamcount / nspam
prob = spamratio / (hamratio + spamratio)
! if prob < MIN_SPAMPROB:
! prob = MIN_SPAMPROB
! elif prob > MAX_SPAMPROB:
! prob = MAX_SPAMPROB
if record.spamprob != prob:
--- 228,257 ----
nham = float(self.nham or 1)
nspam = float(self.nspam or 1)
! A = options.robinson_probability_a
! X = options.robinson_probability_x
! AoverX = A/X
! for word, record in self.wordinfo.iteritems():
# Compute prob(msg is spam | msg contains word).
! # This is the Graham calculation, but stripped of biases, and
! # stripped of clamping into 0.01 thru 0.99. The Bayesian
! # adjustment following keeps them in a sane range, and one
! # that naturally grows the more evidence there is to back up
! # a probability.
! hamcount = min(record.hamcount, nham)
hamratio = hamcount / nham
+
+ spamcount = min(record.spamcount, nspam)
spamratio = spamcount / nspam
prob = spamratio / (hamratio + spamratio)
!
! # Now do Robinson's Bayesian adjustment.
! #
! # a + (n * p(w))
! # f(w) = ---------------
! # (a / x) + n
!
! n = hamcount + spamcount
! prob = (A + n * prob) / (AoverX + n)
if record.spamprob != prob:
***************
*** 481,487 ****
pass
- # XXX More stuff should be reworked to use this as a helper function.
def _getclues(self, wordstream):
mindist = options.robinson_minimum_prob_strength
# A priority queue to remember the MAX_DISCRIMINATORS best
--- 320,326 ----
pass
def _getclues(self, wordstream):
mindist = options.robinson_minimum_prob_strength
+ unknown = options.robinson_probability_x
# A priority queue to remember the MAX_DISCRIMINATORS best
***************
*** 496,504 ****
record = wordinfoget(word)
if record is None:
! prob = UNKNOWN_SPAMPROB
else:
record.atime = now
prob = record.spamprob
-
distance = abs(prob - 0.5)
if distance >= mindist and distance > smallest_best:
--- 335,342 ----
record = wordinfoget(word)
if record is None:
! prob = unknown
else:
record.atime = now
prob = record.spamprob
distance = abs(prob - 0.5)
if distance >= mindist and distance > smallest_best:
***************
*** 506,513 ****
smallest_best = nbest[0][0]
! clues = [(prob, word, record)
! for distance, prob, word, record in nbest
! if prob is not None]
! return clues
#************************************************************************
--- 344,349 ----
smallest_best = nbest[0][0]
! # Return (prob, word, record) for the non-dummies.
! return [t[1:] for t in nbest if t[1] is not None]
#************************************************************************
***************
*** 518,664 ****
# to only one of the alternatives surviving.
- def robinson_spamprob(self, wordstream, evidence=False):
- """Return best-guess probability that wordstream is spam.
-
- wordstream is an iterable object producing words.
- The return value is a float in [0.0, 1.0].
-
- If optional arg evidence is True, the return value is a pair
- probability, evidence
- where evidence is a list of (word, probability) pairs.
- """
-
- from math import frexp
- mindist = options.robinson_minimum_prob_strength
-
- # A priority queue to remember the MAX_DISCRIMINATORS best
- # probabilities, where "best" means largest distance from 0.5.
- # The tuples are (distance, prob, word, wordinfo[word]).
- nbest = [(-1.0, None, None, None)] * MAX_DISCRIMINATORS
- smallest_best = -1.0
-
- wordinfoget = self.wordinfo.get
- now = time.time()
- for word in Set(wordstream):
- record = wordinfoget(word)
- if record is None:
- prob = UNKNOWN_SPAMPROB
- else:
- record.atime = now
- prob = record.spamprob
-
- distance = abs(prob - 0.5)
- if distance >= mindist and distance > smallest_best:
- heapreplace(nbest, (distance, prob, word, record))
- smallest_best = nbest[0][0]
-
- # Compute the probability.
- clues = []
-
- # This combination method is due to Gary Robinson.
- # http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
- # In preliminary tests, it did just as well as Graham's scheme,
- # but creates a definite "middle ground" around 0.5 where false
- # negatives and false positives can actually found in non-trivial
- # number.
-
- # The real P = this P times 2**Pexp. Likewise for Q. We're
- # simulating unbounded dynamic float range by hand. If this pans
- # out, *maybe* we should store logarithms in the database instead
- # and just add them here.
- P = Q = 1.0
- Pexp = Qexp = 0
- num_clues = 0
- for distance, prob, word, record in nbest:
- if prob is None: # it's one of the dummies nbest started with
- continue
- if record is not None: # else wordinfo doesn't know about it
- record.killcount += 1
- if evidence:
- clues.append((word, prob))
- num_clues += 1
- P *= 1.0 - prob
- Q *= prob
- if P < 1e-200: # move back into range
- P, e = frexp(P)
- Pexp += e
- if Q < 1e-200: # move back into range
- Q, e = frexp(Q)
- Qexp += e
-
- P, e = frexp(P)
- Pexp += e
- Q, e = frexp(Q)
- Qexp += e
-
- if num_clues:
- #P = 1.0 - P**(1./num_clues)
- #Q = 1.0 - Q**(1./num_clues)
- #
- # (x*2**e)**n = x**n * 2**(e*n)
- n = 1.0 / num_clues
- P = 1.0 - P**n * 2.0**(Pexp * n)
- Q = 1.0 - Q**n * 2.0**(Qexp * n)
-
- prob = (P-Q)/(P+Q) # in -1 .. 1
- prob = 0.5 + prob/2 # shift to 0 .. 1
- else:
- prob = 0.5
-
- if evidence:
- clues.sort(lambda a, b: cmp(a[1], b[1]))
- return prob, clues
- else:
- return prob
-
- if options.use_robinson_combining:
- spamprob = robinson_spamprob
-
- def robinson_update_probabilities(self):
- """Update the word probabilities in the spam database.
-
- This computes a new probability for every word in the database,
- so can be expensive. learn() and unlearn() update the probabilities
- each time by default. Thay have an optional argument that allows
- to skip this step when feeding in many messages, and in that case
- you should call update_probabilities() after feeding the last
- message and before calling spamprob().
- """
-
- nham = float(self.nham or 1)
- nspam = float(self.nspam or 1)
- A = options.robinson_probability_a
- X = options.robinson_probability_x
- AoverX = A/X
- for word, record in self.wordinfo.iteritems():
- # Compute prob(msg is spam | msg contains word).
- # This is the Graham calculation, but stripped of biases, and
- # of clamping into 0.01 thru 0.99.
- hamcount = min(record.hamcount, nham)
- hamratio = hamcount / nham
-
- spamcount = min(record.spamcount, nspam)
- spamratio = spamcount / nspam
-
- prob = spamratio / (hamratio + spamratio)
-
- # Now do Robinson's Bayesian adjustment.
- #
- # a + (n * p(w))
- # f(w) = ---------------
- # (a / x) + n
-
- n = hamcount + spamcount
- prob = (A + n * prob) / (AoverX + n)
-
- if record.spamprob != prob:
- record.spamprob = prob
- # The next seemingly pointless line appears to be a hack
- # to allow a persistent db to realize the record has changed.
- self.wordinfo[word] = record
-
- if options.use_robinson_probability:
- update_probabilities = robinson_update_probabilities
-
def central_limit_compute_population_stats(self, msgstream, is_spam):
from math import ldexp
--- 354,357 ----
***************
*** 745,751 ****
if options.use_central_limit:
spamprob = central_limit_spamprob
-
-
-
def central_limit_compute_population_stats2(self, msgstream, is_spam):
--- 438,441 ----
From montanaro@users.sourceforge.net Fri Sep 27 23:30:25 2002
From: montanaro@users.sourceforge.net (Skip Montanaro)
Date: Fri, 27 Sep 2002 15:30:25 -0700
Subject: [Spambayes-checkins] spambayes setup.py,1.6,1.7
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv29785
Modified Files:
setup.py
Log Message:
add several new scripts and a couple new modules
Index: setup.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/setup.py,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** setup.py 27 Sep 2002 21:04:06 -0000 1.6
--- setup.py 27 Sep 2002 22:30:23 -0000 1.7
***************
*** 5,8 ****
--- 5,9 ----
scripts=['unheader.py',
'hammie.py',
+ 'hammiesrv.py',
'loosecksum.py',
'timtest.py',
***************
*** 11,18 ****
--- 12,25 ----
'runtest.sh',
'rebal.py',
+ 'HistToGNU.py',
+ 'mboxcount.py',
+ 'mboxtest.py',
+ 'neiltrain.py',
'cmp.py',
'rates.py'],
py_modules=['classifier',
'tokenizer',
+ 'hammie',
+ 'msgs',
'Options',
'Tester',
From npickett@users.sourceforge.net Fri Sep 27 23:38:56 2002
From: npickett@users.sourceforge.net (Neale Pickett)
Date: Fri, 27 Sep 2002 15:38:56 -0700
Subject: [Spambayes-checkins] spambayes hammie.py,1.25,1.26
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv32050
Modified Files:
hammie.py
Log Message:
* PersistentGrahamBayes -> PersistentBayes, reflecting change in
classifier naming.
Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.25
retrieving revision 1.26
diff -C2 -d -r1.25 -r1.26
*** hammie.py 27 Sep 2002 21:18:18 -0000 1.25
--- hammie.py 27 Sep 2002 22:38:53 -0000 1.26
***************
*** 136,140 ****
! class PersistentGrahamBayes(classifier.Bayes):
"""A persistent Bayes classifier.
--- 136,140 ----
! class PersistentBayes(classifier.Bayes):
"""A persistent Bayes classifier.
***************
*** 336,343 ****
def createbayes(pck=DEFAULTDB, usedb=False):
"""Create a Bayes instance for the given pickle (which
! doesn't have to exist). Create a PersistentGrahamBayes if
usedb is True."""
if usedb:
! bayes = PersistentGrahamBayes(pck)
else:
bayes = None
--- 336,343 ----
def createbayes(pck=DEFAULTDB, usedb=False):
"""Create a Bayes instance for the given pickle (which
! doesn't have to exist). Create a PersistentBayes if
usedb is True."""
if usedb:
! bayes = PersistentBayes(pck)
else:
bayes = None
From tim_one@users.sourceforge.net Sat Sep 28 04:41:12 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Fri, 27 Sep 2002 20:41:12 -0700
Subject: [Spambayes-checkins]
spambayes Options.py,1.35,1.36 classifier.py,1.22,1.23
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv6007
Modified Files:
Options.py classifier.py
Log Message:
Gary Robinson changed the forumla he uses to adjust the Graham
probabilities since we first implemented it. The new formula is
identical to the old in what it computes, but it looks a little different
and is easier to understand. As a result,
robinson_probability_a
no longer exists, and
robinson_probability_s
takes its place (the "s" is for "strength"). If you used non-default
values of a and/or x before, x doesn't change, but you should set
robinson_probability_s
to robinson_probability_a / robinson_probability_x.
For example, before this checkin, the defaults were a=0.225 and x= 0.5.
Now 'a' is gone, and s defaults to 0.225/0.5 = 0.45. Computed results
are identical.
Sorry for the hassle, but Gary's webpage does a very nice job of
explaining this formula, and I really don't want to reword it all for
this project -- keeping an obvious connection between our implementation
and Gary's explanation is worth the disruption.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.35
retrieving revision 1.36
diff -C2 -d -r1.35 -r1.36
*** Options.py 27 Sep 2002 22:29:56 -0000 1.35
--- Options.py 28 Sep 2002 03:41:10 -0000 1.36
***************
*** 179,194 ****
# seen before. Nobody has reported an improvement via moving it away
# from 1/2.
! # "a" adjusts how much weight to give the prior assumption relative to
! # the probabilities estimated by counting. At a=0, the counting estimates
# are believed 100%, even to the extent of assigning certainty (0 or 1)
# to a word that's appeared in only ham or only spam. This is a disaster.
! # As "a" tends toward infintity, all probabilities tend toward "x". All
! # reports were that a value near 0.2 worked best, so this doesn't seem to
# be corpus-dependent.
! # XXX Gary Robinson has since renamed "a" to "s", and redone his formulas
! # XXX to make it a measure of belief strength rather than "a number" from
! # XXX 0 to infinity. We haven't caught up to that yet.
! robinson_probability_a: 0.225
robinson_probability_x: 0.5
# When scoring a message, ignore all words with
--- 179,194 ----
# seen before. Nobody has reported an improvement via moving it away
# from 1/2.
! # "s" adjusts how much weight to give the prior assumption relative to
! # the probabilities estimated by counting. At s=0, the counting estimates
# are believed 100%, even to the extent of assigning certainty (0 or 1)
# to a word that's appeared in only ham or only spam. This is a disaster.
! # As s tends toward infintity, all probabilities tend toward x. All
! # reports were that a value near 0.4 worked best, so this doesn't seem to
# be corpus-dependent.
! # NOTE: Gary Robinson previously used a different formula involving 'a'
! # and 'x'. The 'x' here is the same as before. The 's' here is the old
! # 'a' divided by 'x'.
robinson_probability_x: 0.5
+ robinson_probability_s: 0.45
# When scoring a message, ignore all words with
***************
*** 254,259 ****
},
'Classifier': {'max_discriminators': int_cracker,
- 'robinson_probability_a': float_cracker,
'robinson_probability_x': float_cracker,
'robinson_minimum_prob_strength': float_cracker,
--- 254,259 ----
},
'Classifier': {'max_discriminators': int_cracker,
'robinson_probability_x': float_cracker,
+ 'robinson_probability_s': float_cracker,
'robinson_minimum_prob_strength': float_cracker,
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.22
retrieving revision 1.23
diff -C2 -d -r1.22 -r1.23
*** classifier.py 27 Sep 2002 22:29:56 -0000 1.22
--- classifier.py 28 Sep 2002 03:41:10 -0000 1.23
***************
*** 228,234 ****
nham = float(self.nham or 1)
nspam = float(self.nspam or 1)
! A = options.robinson_probability_a
! X = options.robinson_probability_x
! AoverX = A/X
for word, record in self.wordinfo.iteritems():
# Compute prob(msg is spam | msg contains word).
--- 228,233 ----
nham = float(self.nham or 1)
nspam = float(self.nspam or 1)
! S = options.robinson_probability_s
! StimesX = S * options.robinson_probability_x
for word, record in self.wordinfo.iteritems():
# Compute prob(msg is spam | msg contains word).
***************
*** 248,257 ****
# Now do Robinson's Bayesian adjustment.
#
! # a + (n * p(w))
! # f(w) = ---------------
! # (a / x) + n
n = hamcount + spamcount
! prob = (A + n * prob) / (AoverX + n)
if record.spamprob != prob:
--- 247,256 ----
# Now do Robinson's Bayesian adjustment.
#
! # s*x + n*p(w)
! # f(w) = --------------
! # s + n
n = hamcount + spamcount
! prob = (StimesX + n * prob) / (S + n)
if record.spamprob != prob:
From tim_one@users.sourceforge.net Sat Sep 28 04:44:17 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Fri, 27 Sep 2002 20:44:17 -0700
Subject: [Spambayes-checkins] spambayes TestDriver.py,1.17,1.18
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv7256
Modified Files:
TestDriver.py
Log Message:
Hist.display(): reduced the # of columns devoted to showing the bucket
boundaries by 1, and added a column to the histogram proper. There are
enough boundary columns remaining to distinguish 1000 buckets, and even
I never use that many .
Index: TestDriver.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v
retrieving revision 1.17
retrieving revision 1.18
diff -C2 -d -r1.17 -r1.18
*** TestDriver.py 27 Sep 2002 21:18:18 -0000 1.17
--- TestDriver.py 28 Sep 2002 03:44:15 -0000 1.18
***************
*** 62,66 ****
return self
! def display(self, WIDTH=60):
from math import sqrt
if self.n > 0:
--- 62,66 ----
return self
! def display(self, WIDTH=61):
from math import sqrt
if self.n > 0:
***************
*** 81,85 ****
ndigits = len(str(biggest))
! format = "%6.2f %" + str(ndigits) + "d"
for i in range(len(self.buckets)):
--- 81,85 ----
ndigits = len(str(biggest))
! format = "%5.1f %" + str(ndigits) + "d"
for i in range(len(self.buckets)):
From tim_one@users.sourceforge.net Sat Sep 28 08:41:16 2002
From: tim_one@users.sourceforge.net (Tim Peters)
Date: Sat, 28 Sep 2002 00:41:16 -0700
Subject: [Spambayes-checkins]
spambayes Options.py,1.36,1.37 classifier.py,1.23,1.24
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv12459
Modified Files:
Options.py classifier.py
Log Message:
New option
[Classifier]
count_duplicates_only_once_in_training: False
Please try it on your data with True. Because it decreases both ham
and spam mean scores, you'll probably need a smaller spam_cutoff value
too. Various biases in the Graham scheme made this a loser there, but
it may be better under the Robinson scheme. Something I haven't tried:
a smaller value of robinson_probability_s *may* also help when this is
enabled (then again, it may hurt too ...).
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.36
retrieving revision 1.37
diff -C2 -d -r1.36 -r1.37
*** Options.py 28 Sep 2002 03:41:10 -0000 1.36
--- Options.py 28 Sep 2002 07:41:13 -0000 1.37
***************
*** 199,202 ****
--- 199,213 ----
robinson_minimum_prob_strength: 0.1
+ # There's a strange asymmetry in the scheme, where multiple occurrences of
+ # a word in a msg are ignored during scoring, but all add to the spamcount
+ # (or hamcount) during training. This imbalance couldn't be altered without
+ # hurting results under the Graham scheme, but it may well be better to
+ # treat things the same way during training under the Robinson schems. Set
+ # this to true to try that.
+ # NOTE: In Tim's tests this decreased both the ham and spam mean scores,
+ # the former more than the latter. Therefore you'll probably want a smaller
+ # spam_cutoff value when this is enabled.
+ count_duplicates_only_once_in_training: False
+
###########################################################################
# Speculative options for Gary Robinson's central-limit ideas. These may go
***************
*** 257,260 ****
--- 268,272 ----
'robinson_probability_s': float_cracker,
'robinson_minimum_prob_strength': float_cracker,
+ 'count_duplicates_only_once_in_training': boolean_cracker,
'use_central_limit': boolean_cracker,
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.23
retrieving revision 1.24
diff -C2 -d -r1.23 -r1.24
*** classifier.py 28 Sep 2002 03:41:10 -0000 1.23
--- classifier.py 28 Sep 2002 07:41:13 -0000 1.24
***************
*** 282,285 ****
--- 282,287 ----
wordinfoget = wordinfo.get
now = time.time()
+ if options.count_duplicates_only_once_in_training:
+ wordstream = Set(wordstream)
for word in wordstream:
record = wordinfoget(word)
***************
*** 304,307 ****
--- 306,311 ----
wordinfoget = self.wordinfo.get
+ if options.count_duplicates_only_once_in_training:
+ wordstream = Set(wordstream)
for word in wordstream:
record = wordinfoget(word)
From gvanrossum@users.sourceforge.net Sat Sep 28 15:39:13 2002
From: gvanrossum@users.sourceforge.net (Guido van Rossum)
Date: Sat, 28 Sep 2002 07:39:13 -0700
Subject: [Spambayes-checkins] spambayes README.txt,1.28,1.29
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv4475
Modified Files:
README.txt
Log Message:
Clarify test data setup.
Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.28
retrieving revision 1.29
diff -C2 -d -r1.28 -r1.29
*** README.txt 25 Sep 2002 02:09:52 -0000 1.28
--- README.txt 28 Sep 2002 14:39:11 -0000 1.29
***************
*** 210,213 ****
--- 210,217 ----
reservoir/ (contains "backup ham")
+ Every file at the deepest level is used (not just files with .txt
+ extenstions). Every file should have a "Unix From" header before the
+ RFC-822 message (i.e. a line of the form "From ").
+
If you use the same names and structure, huge mounds of the tedious testing
code will work as-is. The more Set directories the merrier, although you
From nascheme@users.sourceforge.net Sat Sep 28 19:48:33 2002
From: nascheme@users.sourceforge.net (Neil Schemenauer)
Date: Sat, 28 Sep 2002 11:48:33 -0700
Subject: [Spambayes-checkins] spambayes Options.py,1.37,1.38
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv8755
Modified Files:
Options.py
Log Message:
Remove mine_message_ids option since it shouldn't hurt to always have it
enabled.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.37
retrieving revision 1.38
diff -C2 -d -r1.37 -r1.38
*** Options.py 28 Sep 2002 07:41:13 -0000 1.37
--- Options.py 28 Sep 2002 18:48:31 -0000 1.38
***************
*** 93,99 ****
mine_received_headers: False
- # If set, the Message-Id is broken down into, hopefully, useful evidence.
- mine_message_ids: False
-
[TestDriver]
# These control various displays in class TestDriver.Driver, and Tester.Test.
--- 93,96 ----
***************
*** 238,242 ****
'count_all_header_lines': boolean_cracker,
'mine_received_headers': boolean_cracker,
- 'mine_message_ids': boolean_cracker,
'check_octets': boolean_cracker,
'octet_prefix_size': int_cracker,
--- 235,238 ----
From nascheme@users.sourceforge.net Sat Sep 28 19:48:54 2002
From: nascheme@users.sourceforge.net (Neil Schemenauer)
Date: Sat, 28 Sep 2002 11:48:54 -0700
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.41,1.42
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv8826
Modified Files:
tokenizer.py
Log Message:
Remove mine_message_ids option since it shouldn't hurt to always have it
enabled.
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.41
retrieving revision 1.42
diff -C2 -d -r1.41 -r1.42
*** tokenizer.py 27 Sep 2002 04:06:12 -0000 1.41
--- tokenizer.py 28 Sep 2002 18:48:52 -0000 1.42
***************
*** 984,996 ****
yield 'received:' + tok
! if options.mine_message_ids:
! msgid = msg.get("message-id", "")
! m = message_id_re.match(msgid)
! if not m:
! # might be weird instead of invalid but who cares?
! yield 'message-id:invalid'
! else:
! # looks okay, return the hostname only
! yield 'message-id:@%s' % m.group(1)
# As suggested by Anthony Baxter, merely counting the number of
--- 984,997 ----
yield 'received:' + tok
! # Message-Id: This seems to be a small win and should no
! # adversely affect a mixed source corpus so it's always enabled.
! msgid = msg.get("message-id", "")
! m = message_id_re.match(msgid)
! if m:
! # looks okay, return the hostname
! yield 'message-id:@%s' % m.group(1)
! else:
! # might be weird instead of invalid but who cares?
! yield 'message-id:invalid'
# As suggested by Anthony Baxter, merely counting the number of
From gvanrossum@users.sourceforge.net Sat Sep 28 19:50:53 2002
From: gvanrossum@users.sourceforge.net (Guido van Rossum)
Date: Sat, 28 Sep 2002 11:50:53 -0700
Subject: [Spambayes-checkins] spambayes README.txt,1.29,1.30
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv9407
Modified Files:
README.txt
Log Message:
Clarify Unix From lines in tests messages -- they're optional.
Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.29
retrieving revision 1.30
diff -C2 -d -r1.29 -r1.30
*** README.txt 28 Sep 2002 14:39:11 -0000 1.29
--- README.txt 28 Sep 2002 18:50:51 -0000 1.30
***************
*** 133,137 ****
===================
cleanarch
! A script to repair mbox archives by finding "From" lines that
should have been escaped, and escaping them.
--- 133,137 ----
===================
cleanarch
! A script to repair mbox archives by finding "Unix From" lines that
should have been escaped, and escaping them.
***************
*** 211,216 ****
Every file at the deepest level is used (not just files with .txt
! extenstions). Every file should have a "Unix From" header before the
! RFC-822 message (i.e. a line of the form "From ").
If you use the same names and structure, huge mounds of the tedious testing
--- 211,217 ----
Every file at the deepest level is used (not just files with .txt
! extenstions). The files may bot don't need to have a "Unix From"
! header before the RFC-822 message (i.e. a line of the form "From
! ").
If you use the same names and structure, huge mounds of the tedious testing
From richiehindle@users.sourceforge.net Sat Sep 28 23:24:25 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Sat, 28 Sep 2002 15:24:25 -0700
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.4,1.5
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv3162
Modified Files:
pop3proxy.py
Log Message:
Improved the timeout code to cope with long delays from the real POP3 server (having an ISP with dodgy POP3 servers is really helping to improve the robustness of pop3proxy.py - I should really add Demon Internet to the credits).
Prevented the self-test code from printing the X-Hammie-Disposition headers, because under the ultra-simple test case they come out as No for both the test ham and the test spam. That doesn't matter because it's only their existence that's being tested for, but a casual observer might think something was broken.
Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** pop3proxy.py 23 Sep 2002 21:20:10 -0000 1.4
--- pop3proxy.py 28 Sep 2002 22:24:22 -0000 1.5
***************
*** 114,118 ****
return len(args) == 0
else:
! # Assume that unknown commands will get an error response.
return False
--- 114,119 ----
return len(args) == 0
else:
! # Assume that an unknown command will get a single-line
! # response. This should work for errors and for POP-AUTH.
return False
***************
*** 121,134 ****
(response, isClosing, timedOut). isClosing is True if the
server closes the socket, which tells found_terminator() to
! close when the response has been sent. timedOut is set if the
! request was still arriving after 30 seconds, and tells
! found_terminator() to proxy the remainder of the response.
"""
! isClosing = False
! timedOut = False
startTime = time.time()
isMulti = self.isMultiline(command, args)
! responseLines = []
isFirstLine = True
while True:
line = self.serverFile.readline()
--- 122,136 ----
(response, isClosing, timedOut). isClosing is True if the
server closes the socket, which tells found_terminator() to
! close when the response has been sent. timedOut is set if a
! TOP or RETR request was still arriving after 30 seconds, and
! tells found_terminator() to proxy the remainder of the response.
"""
! responseLines = []
startTime = time.time()
isMulti = self.isMultiline(command, args)
! isClosing = False
! timedOut = False
isFirstLine = True
+ seenAllHeaders = False
while True:
line = self.serverFile.readline()
***************
*** 148,155 ****
# A normal line - append it to the response and carry on.
responseLines.append(line)
! # Time out after 30 seconds - found_terminator() knows how
# to deal with this.
! if time.time() > startTime + 30:
timedOut = True
break
--- 150,160 ----
# A normal line - append it to the response and carry on.
responseLines.append(line)
+ seenAllHeaders = seenAllHeaders or line in ['\r\n', '\n']
! # Time out after 30 seconds for message-retrieval commands
! # if all the headers are down - found_terminator() knows how
# to deal with this.
! if command in ['TOP', 'RETR'] and \
! seenAllHeaders and time.time() > startTime + 30:
timedOut = True
break
***************
*** 544,548 ****
response = proxy.recv(100)
count, totalSize = map(int, response.split()[1:3])
- print "%d messages in test mailbox" % count
assert count == 2
--- 549,552 ----
***************
*** 554,562 ****
while response.find('\n.\r\n') == -1:
response = response + proxy.recv(1000)
! headerOffset = response.find(hammie.DISPHEADER)
! assert headerOffset != -1
! headerEnd = headerOffset + len(HEADER_EXAMPLE)
! header = response[headerOffset:headerEnd].strip()
! print "Message %d: %s" % (i, header)
# Kill the proxy and the test server.
--- 558,562 ----
while response.find('\n.\r\n') == -1:
response = response + proxy.recv(1000)
! assert response.find(hammie.DISPHEADER) != -1
# Kill the proxy and the test server.
From nascheme@users.sourceforge.net Sun Sep 29 05:14:39 2002
From: nascheme@users.sourceforge.net (Neil Schemenauer)
Date: Sat, 28 Sep 2002 21:14:39 -0700
Subject: [Spambayes-checkins] spambayes tokenizer.py,1.42,1.43
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv11632
Modified Files:
tokenizer.py
Log Message:
Mine the To and Cc headers. This another definite win for me. I'm sure about
the log2 trick but it seems to work okay.
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.42
retrieving revision 1.43
diff -C2 -d -r1.42 -r1.43
*** tokenizer.py 28 Sep 2002 18:48:52 -0000 1.42
--- tokenizer.py 29 Sep 2002 04:14:36 -0000 1.43
***************
*** 8,11 ****
--- 8,12 ----
import email.Errors
import re
+ import math
from sets import Set
***************
*** 771,774 ****
--- 772,778 ----
yield '.'.join(parts[:i])
+ def log2(n, log=math.log, c=math.log(2)):
+ return log(n)/c
+
uuencode_begin_re = re.compile(r"""
^begin \s+
***************
*** 963,966 ****
--- 967,980 ----
for t in tokenize_word(w):
yield prefix + t
+
+ # To:
+ # Cc:
+ # Count the number of addresses in each of the recipient headers.
+ for field in ('to', 'cc'):
+ count = 0
+ for addrs in msg.get_all(field, []):
+ count += len(addrs.split(','))
+ if count > 0:
+ yield '%s:2**%d' % (field, round(log2(count)))
# These headers seem to work best if they're not tokenized: just
From tim.one@comcast.net Sun Sep 29 18:00:05 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 29 Sep 2002 13:00:05 -0400
Subject: [Spambayes-checkins] Checkin notification is hosed
Message-ID:
SourceForge apparently can't connect to python.org:
Checking in rebal.py;
/cvsroot/spambayes/spambayes/rebal.py,v <-- rebal.py
new revision: 1.8; previous revision: 1.7
done
Mailing spambayes-checkins@python.org...
Generating notification message...
Generating notification message... done.
Mailing spambayes-checkins@python.org...
Generating notification message...
Traceback (innermost last):
File "/cvsroot/spambayes/CVSROOT/syncmail", line 336, in ?
main()
File "/cvsroot/spambayes/CVSROOT/syncmail", line 329, in main
blast_mail(subject, people, specs[1:], contextlines, fromhost)
File "/cvsroot/spambayes/CVSROOT/syncmail", line 227, in blast_mail
conn.connect(MAILHOST, MAILPORT)
File "/usr/lib/python1.5/smtplib.py", line 216, in connect
self.sock.connect(host, port)
socket.error: (111, 'Connection refused')
rebal.py now has a -d (dry run) option: If you specify -d, rebal will
display how many files it's going to move, from where and to where, but
won't actually move anything.
From tim.one@comcast.net Sun Sep 29 19:08:05 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 29 Sep 2002 14:08:05 -0400
Subject: [Spambayes-checkins] RE: [Spambayes] On counting words more than once
In-Reply-To: <200209291437.g8TEbf809551@pcp02138704pcs.reston01.va.comcast.net>
Message-ID:
SF still isn't able to mail checkin notifications.
Because Neil, Guido and I all reported improvement via counting duplicate
words (within a message) only once during training, I removed the recent
option for trying this, and we do this all the time now. The checkin
comment is below. Note that you may need to change spam_cutoff!
"""
Removed option count_duplicates_only_once_in_training: this is always
done now. Counting duplicate words in a msg more than once during
training appears to have been helpful under the Graham scheme only because
it acted to counteract other biases.
Under Robinson's unbiased scheme, results improve by counting duplicates
only once during training (just as duplicates are counted only once during
scoring), the ham score mean decreases significantly and consistently,
likewise ham score variance, the spam score mean decreases consistently
(but less than the ham mean decreased, so the spread increases), and spam
score variance increaeses. That implies there's *some* value to be gotten
out of knowing how often a word appears in a msg, but that distorting
spamprob isn't the right way to exploit it.
WordInfo.hamcount now has a different meaning: it's the number of hams in
which the word appears, instead of the number of times the word appears
across all ham. Likewise for WordInfo.spamcount.
Note that because both mean scores decreased, you'll probably want a
smaller spam_cutoff value now. The default spam_cutoff has been changed
from 0.57 to 0.56. But this is corpus-dependent, so be sure to tune your
value for your corpus.
"""
From tim.one@comcast.net Sun Sep 29 21:34:58 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 29 Sep 2002 16:34:58 -0400
Subject: [Spambayes-checkins] Another change
In-Reply-To:
Message-ID:
Change checked in to tokenizer.py:
tokenize_headers(): Based on a silly experiment that *only* tokenized
Subject lines, added a gimmick here to generate tokens for runs of
punctuation characters (\W+) in subject lines.
-> tested 2000 hams & 1400 spams against 18000 hams & 12600 spams
[ditto 19 times]
false positive percentages
0.050 0.000 won -100.00%
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.050 0.050 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.050 0.050 tied
won 1 times
tied 9 times
lost 0 times
total unique fp went from 3 to 2 won -33.33%
mean fp % went from 0.015 to 0.01 won -33.33%
false negative percentages
0.071 0.071 tied
0.071 0.071 tied
0.000 0.000 tied
0.143 0.143 tied
0.143 0.143 tied
0.214 0.214 tied
0.143 0.143 tied
0.143 0.143 tied
0.214 0.214 tied
0.000 0.000 tied
won 0 times
tied 10 times
lost 0 times
total unique fn went from 16 to 16 tied
mean fn % went from 0.114285714286 to 0.114285714286 tied
ham mean ham sdev
25.74 25.65 -0.35% 5.74 5.67 -1.22%
25.69 25.61 -0.31% 5.56 5.50 -1.08%
25.64 25.57 -0.27% 5.74 5.67 -1.22%
25.74 25.66 -0.31% 5.61 5.54 -1.25%
25.50 25.42 -0.31% 5.78 5.72 -1.04%
25.58 25.51 -0.27% 5.44 5.39 -0.92%
25.73 25.65 -0.31% 5.63 5.59 -0.71%
25.69 25.61 -0.31% 5.47 5.41 -1.10%
25.92 25.84 -0.31% 5.54 5.48 -1.08%
25.90 25.81 -0.35% 5.88 5.81 -1.19%
ham mean and sdev for all runs
25.71 25.63 -0.31% 5.64 5.58 -1.06%
spam mean spam sdev
84.07 83.86 -0.25% 7.10 7.09 -0.14%
83.83 83.64 -0.23% 6.84 6.83 -0.15%
83.46 83.27 -0.23% 6.80 6.81 +0.15%
84.03 83.82 -0.25% 6.88 6.88 +0.00%
84.08 83.89 -0.23% 6.68 6.65 -0.45%
83.96 83.78 -0.21% 6.99 6.96 -0.43%
83.62 83.42 -0.24% 6.84 6.82 -0.29%
84.04 83.86 -0.21% 6.71 6.71 +0.00%
84.08 83.88 -0.24% 7.01 6.98 -0.43%
83.97 83.75 -0.26% 6.65 6.65 +0.00%
spam mean and sdev for all runs
83.91 83.72 -0.23% 6.85 6.84 -0.15%
ham/spam mean difference: 58.20 58.09 -0.11
This is consistent but weak. Staring at the false negatives shows
that it's moving them "in the right direction", though, and histogram
analysis says something stronger:
-> best cutoff for all runs: 0.55
-> with weighted total 10*2 fp + 11 fn = 31
-> fp rate 0.01% fn rate 0.0786%
That is, if I had run at spam_cutoff 0.55 instead of 0.56, it would
have been a pure win, leaving f-p alone but dropping 5(!) of the f-n.
From anthonybaxter@users.sourceforge.net Mon Sep 30 05:02:33 2002
From: anthonybaxter@users.sourceforge.net (Anthony Baxter)
Date: Sun, 29 Sep 2002 21:02:33 -0700
Subject: [Spambayes-checkins] website related.ht,1.1.1.1,1.2
Message-ID:
Update of /cvsroot/spambayes/website
In directory usw-pr-cvs1:/tmp/cvs-serv4280
Modified Files:
related.ht
Log Message:
added PASP.
Index: related.ht
===================================================================
RCS file: /cvsroot/spambayes/website/related.ht,v
retrieving revision 1.1.1.1
retrieving revision 1.2
diff -C2 -d -r1.1.1.1 -r1.2
*** related.ht 19 Sep 2002 08:40:55 -0000 1.1.1.1
--- related.ht 30 Sep 2002 04:02:31 -0000 1.2
***************
*** 11,14 ****
--- 11,15 ----
- Eric Raymond's bogofilter, a C code bayesian filter.
- ifile, a Naive Bayes classification system.
+
- PASP, the Python Anti-Spam Proxy - a POP3 proxy for filtering email. Also uses Bayesian-ish classification.
- ...
From richiehindle@users.sourceforge.net Mon Sep 30 21:13:42 2002
From: richiehindle@users.sourceforge.net (Richie Hindle)
Date: Mon, 30 Sep 2002 13:13:42 -0700
Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.5,1.6
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv32034
Modified Files:
pop3proxy.py
Log Message:
Use options.spam_cutoff instead of hammie.SPAM_THRESHOLD - the
latter is far too high under the new default scoring scheme (I've sent a
separate heads-up to Neale about this).
Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** pop3proxy.py 28 Sep 2002 22:24:22 -0000 1.5
--- pop3proxy.py 30 Sep 2002 20:13:39 -0000 1.6
***************
*** 37,40 ****
--- 37,41 ----
import socket, asyncore, asynchat
import classifier, tokenizer, hammie
+ from Options import options
HEADER_FORMAT = '%s: %%s\r\n' % hammie.DISPHEADER
***************
*** 344,348 ****
# it's been classified.
prob = self.bayes.spamprob(tokenizer.tokenize(message))
! if prob >= hammie.SPAM_THRESHOLD:
disposition = "Yes"
else:
--- 345,349 ----
# it's been classified.
prob = self.bayes.spamprob(tokenizer.tokenize(message))
! if prob > options.spam_cutoff:
disposition = "Yes"
else:
From montanaro@users.sourceforge.net Mon Sep 30 22:56:29 2002
From: montanaro@users.sourceforge.net (Skip Montanaro)
Date: Mon, 30 Sep 2002 14:56:29 -0700
Subject: [Spambayes-checkins] spambayes Options.py,1.39,1.40
tokenizer.py,1.45,1.46
Message-ID:
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv8971
Modified Files:
Options.py tokenizer.py
Log Message:
allow users to disable the long word skip tokens (e.g "skip:c 70") under the
assumption that people who do receive mail which contains attachements will
be penalized.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.39
retrieving revision 1.40
diff -C2 -d -r1.39 -r1.40
*** Options.py 29 Sep 2002 18:03:39 -0000 1.39
--- Options.py 30 Sep 2002 21:56:27 -0000 1.40
***************
*** 93,96 ****
--- 93,102 ----
mine_received_headers: False
+ # If your ham corpus is generated from sources which contain few, if any
+ # attachments you probably want to leave this alone. If you have many
+ # legitimate correspondents who send you attachments (Excel spreadsheets,
+ # etc), you might want to set this to False.
+ generate_long_skips: True
+
[TestDriver]
# These control various displays in class TestDriver.Driver, and Tester.Test.
***************
*** 223,226 ****
--- 229,233 ----
'safe_headers': ('get', lambda s: Set(s.split())),
'count_all_header_lines': boolean_cracker,
+ 'generate_long_skips': boolean_cracker,
'mine_received_headers': boolean_cracker,
'check_octets': boolean_cracker,
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.45
retrieving revision 1.46
diff -C2 -d -r1.45 -r1.46
*** tokenizer.py 29 Sep 2002 20:20:57 -0000 1.45
--- tokenizer.py 30 Sep 2002 21:56:27 -0000 1.46
***************
*** 645,649 ****
# XXX Figure out why, and/or see if some other way of summarizing
# XXX this info has greater benefit.
! yield "skip:%c %d" % (word[0], n // 10 * 10)
if has_highbit_char(word):
hicount = 0
--- 645,650 ----
# XXX Figure out why, and/or see if some other way of summarizing
# XXX this info has greater benefit.
! if options.generate_long_skips:
! yield "skip:%c %d" % (word[0], n // 10 * 10)
if has_highbit_char(word):
hicount = 0
From tim.one@comcast.net Mon Sep 30 23:07:04 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 30 Sep 2002 18:07:04 -0400
Subject: [Spambayes-checkins] spambayes
Options.py,1.39,1.40tokenizer.py,1.45,1.46
In-Reply-To:
Message-ID:
[Skip Montanaro]
> allow users to disable the long word skip tokens (e.g "skip:c
> 70") under the assumption that people who do receive mail which
> contains attachements will be penalized.
Skip, what is your reasoning here? We ignore attachments entirely unless
they have text/* type. I don't see what skip tokens have to do with this.
Besides, I named those tokens after you .