From tim_one@users.sourceforge.net Thu Sep 5 21:17:34 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Thu, 05 Sep 2002 13:17:34 -0700 Subject: [Spambayes-checkins] spambayes README.txt,NONE,1.1 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv25267 Added Files: README.txt Log Message: Some sorely needed clues. --- NEW FILE: README.txt --- Assorted clues. What's Here? ============ Lots of mondo cool undocumented code. What else could there be ? The focus of this project so far has not been to produce the fastest or smallest filters, but to set up a flexible pure-Python implementation for doing algorithm research. Lots of people are making fast/small implementations, and it takes an entirely different kind of effort to make genuine algorithm improvements. I think we've done quite well at that so far. The focus of this codebase may change to small/fast later -- as is, the false positive rate has gotten too small to measure reliably across test sets with 4000 hams + 2750 spams, but the false negative rate is still over 1%. Primary Files ============= classifier.py An implementation of a Graham-like classifier. Tester.py A test-driver class that feeds streams of msgs to a classifier instance, and keeps track of right/wrong percentages, and lists of false positives and false negatives. timtest.py A concrete test driver and tokenizer that uses Tester and classifier (above). This assumes "a standard" test data setup (see below). Could stand massive refactoring. GBayes.py A number of tokenizers and a partial test driver. This assumes an mbox format. Could stand massive refactoring. I don't think it's been kept up to date. Test Data Utilities =================== rebal.py Evens out the number of messages in "standard" test data folders (see below). cleanarch A script to repair mbox archives by finding "From" lines that should have been escaped, and escaping them. mboxcount.py Count the number of messages (both parseable and unparseable) in mbox archives. split.py splitn.py Split an mbox into random pieces in various ways. Tim recommends using "the standard" test data set up instead (see below). Standard Test Data Setup ======================== Barry gave me mboxes, but the spam corpus I got off the web had one spam per file, and it only took two days of extreme pain to realize that one msg per file is enormously easier to work with when testing: you want to split these at random into random collections, you may need to replace some at random when testing reveals spam mistakenly called ham (and vice versa), etc -- even pasting examples into email is much easier when it's one msg per file (and the test driver makes it easy to print a msg's file path). The directory structure under my spambayes directory looks like so: Data/ Spam/ Set1/ (contains 2750 spam .txt files) Set2/ "" Set3/ "" Set4/ "" Set5/ "" Ham/ Set1/ (contains 4000 ham .txt files) Set2/ "" Set3/ "" Set4/ "" Set5/ "" reservoir/ (contains "backup ham") If you use the same names and structure, huge mounds of the tedious testing code will work as-is. The more Set directories the merrier, although you'll hit a point of diminishing returns if you exceed 10. The "reservoir" directory contains a few thousand other random hams. When a ham is found that's really spam, I delete it, and then the rebal.py utility moves in a message at random from the reservoir to replace it. If I had it to do over again, I think I'd move such spam into a Spam set (chosen at random), instead of deleting it. The hams are 20,000 msgs selected at random from a python-list archive. The spams are essentially all of Bruce Guenter's 2002 spam archive: The sets are grouped into 5 pairs in the obvious way: Spam/Set1 with Ham/Set1, and so on. For each such pair, timtest trains a classifier on that pair, then runs predictions on each of the other 4 pairs. In effect, it's a 5x5 test grid, skipping the diagonal. There's no particular reason to avoid predicting against the same set trained on, except that it takes more time and seems the least interesting thing to try. From tim_one@users.sourceforge.net Thu Sep 5 21:55:04 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Thu, 05 Sep 2002 13:55:04 -0700 Subject: [Spambayes-checkins] spambayes TESTING.txt,NONE,1.1 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv7401 Added Files: TESTING.txt Log Message: Adapted more python-dev msgs into clue form. --- NEW FILE: TESTING.txt --- [Clues about the practice of statistical testing, adapted from Tim's comments on python-dev.] Combining pairs of words is called "word bigrams". My intuition at the start was that it would do better. OTOH, my intuition also was that character n-grams for a relatively large n would do better still. The latter may be so for "foreign" languages, but for this particular task using Graham's scheme on the c.l.py tests, turns out they sucked. A comment block in timtest.py explains why. I didn't try word bigrams because the f-p rate is already supernaturally low, so there doesn't seem anything left to be gained there. This echoes what Graham sez on his web page: One idea that I haven't tried yet is to filter based on word pairs, or even triples, rather than individual words. This should yield a much sharper estimate of the probability. My comment with benefit of hindsight: it doesn't. Because the scoring scheme throws away everything except about a dozen extremes, the "probabilities" that come out are almost always very near 0 or very near 1; only very short or (or especially "and") very bland msgs come out in between. This outcome is largely independent of the tokenization scheme -- the scoring scheme forces it, provided only that the tokenization scheme produces stuff *some* of which *does* vary in frequency between spam and ham. For example, in my current database, the word "offers" has a probability of .96. If you based the probabilities on word pairs, you'd end up with "special offers" and "valuable offers" having probabilities of .99 and, say, "approach offers" (as in "this approach offers") having a probability of .1 or less. The theory is indeed appealing . The reason I haven't done this is that filtering based on individual words already works so well. Which is also the reason I didn't pursue it. But it does mean that there is room to tighten the filters if spam gets harder to detect. I expect it would also need a different scoring scheme then. OK, I ran a full test using word bigrams. It gets one strike against it at the start because the database size grows by a factor between 2 and 3. That's only justified if the results are better. Before-and-after f-p (false positive) percentages: before bigrams 0.000 0.025 0.000 0.025 0.050 0.050 0.000 0.025 0.025 0.050 0.025 0.100 0.050 0.075 0.025 0.025 0.025 0.050 0.000 0.025 0.075 0.050 0.050 0.000 0.025 0.050 0.000 0.025 0.050 0.075 0.025 0.025 0.025 0.025 0.000 0.000 0.025 0.050 0.050 0.025 Lost on 12 runs Tied on 5 runs Won on 3 runs total # of unique fps across all runs rose from 8 to 17 The f-n percentages on the same runs: before bigrams 1.236 1.091 1.164 1.091 1.454 1.708 1.599 1.563 1.527 1.491 1.236 1.127 1.163 1.345 1.309 1.309 1.891 1.927 1.418 1.382 1.745 1.927 1.708 1.963 1.491 1.782 0.836 0.800 1.091 1.127 1.309 1.309 1.491 1.709 1.127 1.018 1.309 1.018 1.636 1.672 Lost on 9 runs Tied on 2 runs Won on 9 runs total # of unique fns across all runs rose from 336 to 350 This doesn't need deep analysis: it costs more, and on the face of it either doesn't help, or helps so little it's not worth the cost. Now I'll tell you in confidence that the way to make a scheme like this excellent is to keep your ego out of it and let the data *tell* you what works: getting the best test setup you can is the most important thing you can possibly do. It must include multiple training and test corpora (e.g., if I had used only one pair, I would have had a 3/20 chance of erroneously concluding that bigrams might help the f-p rate, when running across 20 pairs shows that they almost certainly do it harm; while I would have had an even chance of drawing a wrong conclusion-- in either direction --about the effect on the f-n rate). The second most important thing is to run a fat test all the way to the end before concluding anything. A subtler point is that you should never keep a change that doesn't *prove* itself a winner: neutral changes bloat your code with proven irrelevancies that will come back to make your life harder later, in part because they'll randomly interfere with future changes in ways that make it harder to recognize a significant change when you stumble into one. Most things you try won't help -- indeed, many of them will deliver worse results. I dare say my intuition for this kind of classification task is better than most programmers' (in part because I had years of professional experience in a related field), and most of the things I tried I had to throw away. BFD -- then you try something else. When I find something that works I can rationalize it, but when I try something that doesn't, no amount of argument can change that the data said it sucked . Two things about *this* task have fooled me repeatedly: 1. The "only look at smoking guns" nature of the scoring step makes many kinds of "on average" intuitions worthless: "on average" almost everything is thrown away! For example, you're not going to find bad results reported for n-grams (neither character- nor word-based) in the literature, and because most scoring schemes throw much less away. Graham's scheme strikes me as brilliant in this specific respect: it's worth enduring the ego humiliation to get such a spectacularly low f-p rate from such simple and fast code. Graham's assumption that the spam-vs-ham distinction should be *easy* pays off big. 2. Most mailing-list messages are much shorter than this one. This systematically frustrates "well, averaged over enough words" intuitions too. Cute: In particular, word bigrams systematically hate conference announcements. The current word one-gram scheme hated them too, until I started folding case. Then their SCREAMING stopped acting against them. But they're still using the language of advertisement, and word bigrams can't help but notice that more strongly than individual words do. Here from the TOOLS Europe '99 announcement: prob('more information') = 0.916003 prob('web site') = 0.895518 prob('please write') = 0.99 prob('you wish') = 0.984494 prob('our web') = 0.985578 prob('visit our') = 0.99 Here from the XP2001 - FINAL CALL FOR PAPERS: prob('web site:') = 0.926174 prob('receive this') = 0.945813 prob('you receive') = 0.987542 prob('most exciting') = 0.99 prob('alberta, canada') = 0.99 prob('e-mail to:') = 0.99 Here from the XP2002 - CALL FOR PRACTITIONER'S REPORTS ('BOM' is an artificial token I made up for "beginning of message", to give something for the first word in the message to pair up with): prob('web site:') = 0.926174 prob('this announcement') = 0.94359 prob('receive this') = 0.945813 prob('forward this') = 0.99 prob('e-mail to:') = 0.99 prob('BOM *****') = 0.99 prob('you receive') = 0.987542 Here from the TOOLS Europe 2000 announcement: prob('visit the') = 0.96 prob('you receive') = 0.967805 prob('accept our') = 0.99 prob('our apologies') = 0.99 prob('quality and') = 0.99 prob('receive more') = 0.99 prob('asia and') = 0.99 A vanilla f-p showing where bigrams can hurt was a short msg about setting up a Python user's group. Bigrams gave it large penalties for phrases like "fully functional" (most often seen in spams for bootleg software, but here applied to the proposed user group's web site -- and "web site" is also a strong spam indicator!). OTOH, the poster also said "Aahz rocks". As a bigram, that neither helped nor hurt (that 2-word phrase is unique in the corpus); but as an individual word, "Aahz" is a strong non-spam indicator on c.l.py (and will probably remain so until he starts spamming ). It did find one spam hiding in a ham corpus: """ NNTP-Posting-Host: 212.64.45.236 Newsgroups: comp.lang.python,comp.lang.rexx Date: Thu, 21 Oct 1999 10:18:52 -0700 Message-ID: <67821AB23987D311ADB100A0241979E5396955@news.ykm.com> From: znblrn@hetronet.com Subject: Rudolph The Rednose Hooters Here Lines: 4 Path: news!uunet!ffx.uu.net!newsfeed.fast.net!howland.erols.net!newsfeed.cwix.com!news.cfw.com!paxfeed.eni.net!DAIPUB.DataAssociatesInc..com Xref: news comp.lang.python:74468 comp.lang.rexx:31946 To: python-list@python.org THis IS it: The site where they talk about when you are 50 years old. http://huizen.dds.nl/~jansen20 """ there's-no-substitute-for-experiment-except-drugs-ly y'rs - tim Other points: + Something I didn't do but should have: keep a detailed log of every experiment run, and of the results you got. The only clues about dozens of experiments with the current code are in brief "XXX" comment blocks, and a bunch of test results were lost when we dropped the old checkin comments on the way to moving this code to SourceForge. + Every time you check in an algorithmic change that proved to be a winner, in theory you should also reconsider every previous change. You really can't guess whether, e.g., tokenization changes are all independent of each other, or whether some reinforce others in helpful ways. In practice there's not enough time to reconsider everything every time, but do make a habit of reconsidering *something* each time you've had a success. Nothing is sacred except the results in the end, and heresy can pay; every decision remains suspect forever. + Any sufficiently general scheme with enough free parameters can eventually be trained to recognize any specific dataset exactly. It's wonderful if other people test your changes against other datasets too. That's hard to arrange, so at least change your own data periodically. I'm suspicious that some of the weirder "proven winner" changes I've made are really specific to statistical anomalies in my test data; and as the error rates get closer to 0%, the chance that a winning change helped only a few specific msgs zooms (of course sometimes that's intentional! I haven't been shy about adding changes specifically geared toward squahsing very narrow classes of false positives). From tim_one@users.sourceforge.net Fri Sep 6 00:34:43 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Thu, 05 Sep 2002 16:34:43 -0700 Subject: [Spambayes-checkins] spambayes rates.py,NONE,1.1 README.txt,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv8648 Modified Files: README.txt Added Files: rates.py Log Message: Checking in one of the helper scripts I use to analyze test output. --- NEW FILE: rates.py --- """ rates.py basename Assuming that file basename + '.txt' contains output from timtest.py, scans that file for summary statistics, displays them to stdout, and also writes them to file basename + 's.txt' (where the 's' means 'summary'). This doesn't need a full output file, and will display stuff for as far as the output file has gotten so far. Two of these summary files can later be fed to cmp.py. """ import re import sys """ Training on Data/Ham/Set1 & Data/Spam/Set1 ... 4000 hams & 2750 spams testing against Data/Ham/Set2 & Data/Spam/Set2 ... 4000 hams & 2750 spams false positive: 0.025 false negative: 1.34545454545 new false positives: ['Data/Ham/Set2/66645.txt'] """ pat1 = re.compile(r'\s*Training on Data/').match pat2 = re.compile(r'\s+false (positive|negative): (.*)').match pat3 = re.compile(r"\s+new false (positives|negatives): \[(.+)\]").match def doit(basename): ifile = file(basename + '.txt') oname = basename + 's.txt' ofile = file(oname, 'w') print basename, '->', oname def dump(*stuff): msg = ' '.join(map(str, stuff)) print msg print >> ofile, msg nfn = nfp = 0 ntrainedham = ntrainedspam = 0 for line in ifile: "Training on Data/Ham/Set1 & Data/Spam/Set1 ... 4000 hams & 2750 spams" m = pat1(line) if m: dump(line[:-1]) fields = line.split() ntrainedham += int(fields[-5]) ntrainedspam += int(fields[-2]) continue "false positive: 0.025" "false negative: 1.34545454545" m = pat2(line) if m: kind, guts = m.groups() guts = float(guts) if kind == 'positive': lastval = guts else: dump(' %7.3f %7.3f' % (lastval, guts)) continue "new false positives: ['Data/Ham/Set2/66645.txt']" m = pat3(line) if m: # note that it doesn't match at all if the list is "[]" kind, guts = m.groups() n = len(guts.split()) if kind == 'positives': nfp += n else: nfn += n dump('total false pos', nfp, nfp * 1e2 / ntrainedham) dump('total false neg', nfn, nfn * 1e2 / ntrainedspam) for name in sys.argv[1:]: doit(name) Index: README.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/README.txt,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** README.txt 5 Sep 2002 20:17:31 -0000 1.1 --- README.txt 5 Sep 2002 23:34:41 -0000 1.2 *************** *** 38,41 **** --- 38,48 ---- + Test Utilities + ============== + rates.py + Scans the output (so far) from timtest.py, and captures summary + statistics. + + Test Data Utilities =================== From tim_one@users.sourceforge.net Fri Sep 6 00:42:55 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Thu, 05 Sep 2002 16:42:55 -0700 Subject: [Spambayes-checkins] spambayes cmp.py,NONE,1.1 README.txt,1.2,1.3 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv11162 Modified Files: README.txt Added Files: cmp.py Log Message: Checking in the script I use to produce listings of changes in f-p and f-n rates between two test runs. --- NEW FILE: cmp.py --- """ cmp.py sbase1 sbase2 Combines output from sbase1.txt and sbase2.txt, which are created by rates.py from timtest.py output, and displays comparison statistics to stdout. """ import sys f1n, f2n = sys.argv[1:3] NSETS = 5 # Return # (list of all f-p rates, # list of all f-n rates, # total f-p, # total f-n) # from summary file f. def suck(f): fns = [] fps = [] for block in range(NSETS): # Skip, e.g., # Training on Data/Ham/Set1 & Data/Spam/Set1 ... 4000 hams & 2750 spams f.readline() for inner in range(NSETS - 1): # A line with an f-p rate and an f-n rate. p, n = map(float, f.readline().split()) fps.append(p) fns.append(n) # "total false pos 8 0.04" # "total false neg 249 1.81090909091" fptot = int(f.readline().split()[-2]) fntot = int(f.readline().split()[-2]) return fps, fns, fptot, fntot def dump(p1s, p2s): alltags = "" for p1, p2 in zip(p1s, p2s): if p1 < p2: tag = "lost" elif p1 > p2: tag = "won" else: tag = "tied" print " %5.3f %5.3f %s" % (p1, p2, tag) alltags += tag + " " print for tag in "won", "tied", "lost": print "%-4s %2d %s" % (tag, alltags.count(tag), "times") print fp1, fn1, fptot1, fntot1 = suck(file(f1n + '.txt')) fp2, fn2, fptot2, fntot2 = suck(file(f2n + '.txt')) print f1n, '->', f2n print print "false positive percentages" dump(fp1, fp2) print "total unique fp went from", fptot1, "to", fptot2 print print "false negative percentages" dump(fn1, fn2) print "total unique fn went from", fntot1, "to", fntot2 Index: README.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/README.txt,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** README.txt 5 Sep 2002 23:34:41 -0000 1.2 --- README.txt 5 Sep 2002 23:42:52 -0000 1.3 *************** *** 44,47 **** --- 44,52 ---- statistics. + cmp.py + Given two summary files produced by rates.py, displays an account + of all the f-p and f-n rates side-by-side, along with who won which + (etc), and the change in total # of f-ps and f-n. + Test Data Utilities From tim_one@users.sourceforge.net Fri Sep 6 00:51:34 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Thu, 05 Sep 2002 16:51:34 -0700 Subject: [Spambayes-checkins] spambayes timtest.py,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv13855 Modified Files: timtest.py Log Message: Pure win for the f-n rate: take X-Mailer into account. false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.050 0.075 lost 0.000 0.000 tied 0.025 0.025 tied 0.025 0.025 tied 0.050 0.050 tied 0.025 0.025 tied 0.025 0.025 tied 0.050 0.050 tied 0.075 0.075 tied 0.025 0.025 tied 0.025 0.025 tied 0.025 0.025 tied 0.025 0.025 tied 0.025 0.025 tied 0.025 0.025 tied 0.000 0.000 tied 0.025 0.025 tied 0.050 0.050 tied won 0 times tied 19 times lost 1 times total unique fp went from 8 to 8 false negative percentages 0.691 0.582 won 0.655 0.618 won 0.945 0.836 won 1.309 1.236 won 1.164 1.018 won 0.800 0.764 won 0.763 0.691 won 1.163 1.054 won 1.345 1.236 won 1.127 1.018 won 1.345 1.236 won 1.490 1.418 won 0.909 0.764 won 0.582 0.473 won 0.691 0.509 won 1.163 0.945 won 1.018 0.945 won 0.873 0.727 won 0.909 0.764 won 1.127 0.981 won won 20 times tied 0 times lost 0 times total unique fn went from 249 to 226 Index: timtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timtest.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** timtest.py 5 Sep 2002 16:16:43 -0000 1.1 --- timtest.py 5 Sep 2002 23:51:32 -0000 1.2 *************** *** 508,513 **** # From: # Reply-To: ! # X-Mailer: ! for field in ('from',):# 'reply-to', 'x-mailer',): prefix = field + ':' subj = msg.get(field, '-None-') --- 508,512 ---- # From: # Reply-To: ! for field in ('from',):# 'reply-to',): prefix = field + ':' subj = msg.get(field, '-None-') *************** *** 515,518 **** --- 514,526 ---- for t in tokenize_word(w): yield prefix + t + + # These headers seem to work best if they're not tokenized: just + # normalize case and whitespace. + # X-Mailer: This is a pure and significant win for the f-n rate; f-p + # rate isn't affected. + for field in ('x-mailer',): + prefix = field + ':' + subj = msg.get(field, '-None-') + yield prefix + ' '.join(subj.lower().split()) # Organization: From tim_one@users.sourceforge.net Fri Sep 6 01:10:53 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Thu, 05 Sep 2002 17:10:53 -0700 Subject: [Spambayes-checkins] spambayes timtest.py,1.2,1.3 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv21652 Modified Files: timtest.py Log Message: Added a note about why User-Agent is skipped. Index: timtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timtest.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** timtest.py 5 Sep 2002 23:51:32 -0000 1.2 --- timtest.py 6 Sep 2002 00:10:51 -0000 1.3 *************** *** 519,522 **** --- 519,527 ---- # X-Mailer: This is a pure and significant win for the f-n rate; f-p # rate isn't affected. + # User-Agent: Skipping it, as it made no difference. Very few spams + # had a User-Agent field, but lots of hams didn't either, + # and the spam probability of User-Agent was very close to + # 0.5 (== not a valuable discriminator) across all training + # sets. for field in ('x-mailer',): prefix = field + ':' From tim_one@users.sourceforge.net Fri Sep 6 05:25:47 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Thu, 05 Sep 2002 21:25:47 -0700 Subject: [Spambayes-checkins] spambayes cmp.py,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv29991 Modified Files: cmp.py Log Message: Added a %-changed column. Index: cmp.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/cmp.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** cmp.py 5 Sep 2002 23:42:52 -0000 1.1 --- cmp.py 6 Sep 2002 04:25:45 -0000 1.2 *************** *** 37,54 **** return fps, fns, fptot, fntot def dump(p1s, p2s): alltags = "" for p1, p2 in zip(p1s, p2s): ! if p1 < p2: ! tag = "lost" ! elif p1 > p2: ! tag = "won" ! else: ! tag = "tied" ! print " %5.3f %5.3f %s" % (p1, p2, tag) ! alltags += tag + " " print ! for tag in "won", "tied", "lost": ! print "%-4s %2d %s" % (tag, alltags.count(tag), "times") print --- 37,61 ---- return fps, fns, fptot, fntot + def tag(p1, p2): + if p1 == p2: + t = "tied" + else: + t = p1 < p2 and "lost " or "won " + if p1: + p = (p2 - p1) * 100.0 / p1 + t += " %+7.2f%%" % p + else: + t += " +(was 0)" + return t + def dump(p1s, p2s): alltags = "" for p1, p2 in zip(p1s, p2s): ! t = tag(p1, p2) ! print " %5.3f %5.3f %s" % (p1, p2, t) ! alltags += t + " " print ! for t in "won", "tied", "lost": ! print "%-4s %2d %s" % (t, alltags.count(t), "times") print From tim_one@users.sourceforge.net Fri Sep 6 05:41:16 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Thu, 05 Sep 2002 21:41:16 -0700 Subject: [Spambayes-checkins] spambayes timtest.py,1.3,1.4 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv685 Modified Files: timtest.py Log Message: Generated tokens for: Content-Type and its type= param Content-Dispostion and its filename= param Content-Transfer-Encoding all the charsets This has huge benefit for the f-n rate, and virtually none on the f-p rate, although it does reduce the variance of the f-p rate across different training sets (really marginal msgs, like a brief HTML msg saying just "unsubscribe me", are almost always tagged as spam now; before they were right on the edge, and now the multipart/alternative pushes them over it more consistently). XXX I put all of this in as one chunk. I don't know which parts are XXX most effective; it could be that some parts don't help at all. But XXX given the nature of the c.l.py tests, it's not surprising that the XXX 'content-type:text/html' XXX token is now the single most powerful spam indicator (== makes it XXX into the nbest list most often). What *is* a little surprising is XXX that this doesn't push more mixed-type msgs into the f-p camp -- XXX unlike looking at *all* HTML tags, this is just one spam indicator XXX instead of dozens, so relevant msg content can cancel it out. false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.075 0.100 lost +33.33% 0.000 0.000 tied 0.025 0.025 tied 0.025 0.025 tied 0.050 0.100 lost +100.00% 0.025 0.025 tied 0.025 0.025 tied 0.050 0.050 tied 0.075 0.100 lost +33.33% 0.025 0.025 tied 0.025 0.025 tied 0.025 0.025 tied 0.025 0.025 tied 0.025 0.025 tied 0.025 0.025 tied 0.000 0.000 tied 0.025 0.025 tied 0.050 0.100 lost +100.00% won 0 times tied 16 times lost 4 times total unique fp went from 8 to 9 false negative percentages 0.582 0.364 won -37.46% 0.618 0.400 won -35.28% 0.836 0.400 won -52.15% 1.236 0.909 won -26.46% 1.018 0.836 won -17.88% 0.764 0.618 won -19.11% 0.691 0.291 won -57.89% 1.054 1.018 won -3.42% 1.236 0.982 won -20.55% 1.018 0.727 won -28.59% 1.236 0.800 won -35.28% 1.418 1.163 won -17.98% 0.764 0.764 tied 0.473 0.473 tied 0.509 0.473 won -7.07% 0.945 0.727 won -23.07% 0.945 0.655 won -30.69% 0.727 0.509 won -29.99% 0.764 0.545 won -28.66% 0.981 0.509 won -48.11% won 18 times tied 2 times lost 0 times total unique fn went from 226 to 168 Index: timtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timtest.py,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** timtest.py 6 Sep 2002 00:10:51 -0000 1.3 --- timtest.py 6 Sep 2002 04:41:13 -0000 1.4 *************** *** 477,480 **** --- 477,531 ---- yield "skip:%c %d" % (word[0], n // 10 * 10) + # Generate tokens for: + # Content-Type + # and its type= param + # Content-Dispostion + # and its filename= param + # Content-Transfer-Encoding + # all the charsets + # + # This has huge benefit for the f-n rate, and virtually none on the f-p rate, + # although it does reduce the variance of the f-p rate across different + # training sets (really marginal msgs, like a brief HTML msg saying just + # "unsubscribe me", are almost always tagged as spam now; before they were + # right on the edge, and now the multipart/alternative pushes them over it + # more consistently). + # + # XXX I put all of this in as one chunk. I don't know which parts are + # XXX most effective; it could be that some parts don't help at all. But + # XXX given the nature of the c.l.py tests, it's not surprising that the + # XXX 'content-type:text/html' + # XXX token is now the single most powerful spam indicator (== makes it + # XXX into the nbest list most often). What *is* a little surprising is + # XXX that this doesn't push more mixed-type msgs into the f-p camp -- + # XXX unlike looking at *all* HTML tags, this is just one spam indicator + # XXX instead of dozens, so relevant msg content can cancel it out. + def crack_content_xyz(msg): + x = msg.get_type() + if x is not None: + yield 'content-type:' + x.lower() + + x = msg.get_param('type') + if x is not None: + yield 'content-type/type:' + x.lower() + + for x in msg.get_charsets(None): + if x is not None: + yield 'charset:' + x.lower() + + x = msg.get('content-disposition') + if x is not None: + yield 'content-disposition:' + x.lower() + + fname = msg.get_filename() + if fname is not None: + for x in fname.lower().split('/'): + for y in x.split('.'): + yield 'filename:' + y + + x = msg.get('content-transfer-encoding:') + if x is not None: + yield 'content-transfer-encoding:' + x.lower() + def tokenize(string): # Create an email Message object. *************** *** 493,502 **** # XXX where "safe" is specific to my sorry corpora. # Subject: # Don't ignore case in Subject lines; e.g., 'free' versus 'FREE' is # especially significant in this context. Experiment showed a small # but real benefit to keeping case intact in this specific context. ! subj = msg.get('subject', '') ! for w in subject_word_re.findall(subj): for t in tokenize_word(w): yield 'subject:' + t --- 544,560 ---- # XXX where "safe" is specific to my sorry corpora. + # Content-{Transfer-Encoding, Type, Disposition} and their params. + t = '' + for x in msg.walk(): + for w in crack_content_xyz(x): + yield t + w + t = '>' + # Subject: # Don't ignore case in Subject lines; e.g., 'free' versus 'FREE' is # especially significant in this context. Experiment showed a small # but real benefit to keeping case intact in this specific context. ! x = msg.get('subject', '') ! for w in subject_word_re.findall(x): for t in tokenize_word(w): yield 'subject:' + t *************** *** 510,515 **** for field in ('from',):# 'reply-to',): prefix = field + ':' ! subj = msg.get(field, '-None-') ! for w in subj.lower().split(): for t in tokenize_word(w): yield prefix + t --- 568,573 ---- for field in ('from',):# 'reply-to',): prefix = field + ':' ! x = msg.get(field, 'none').lower() ! for w in x.split(): for t in tokenize_word(w): yield prefix + t *************** *** 526,531 **** for field in ('x-mailer',): prefix = field + ':' ! subj = msg.get(field, '-None-') ! yield prefix + ' '.join(subj.lower().split()) # Organization: --- 584,589 ---- for field in ('x-mailer',): prefix = field + ':' ! x = msg.get(field, 'none').lower() ! yield prefix + ' '.join(x.split()) # Organization: From tim_one@users.sourceforge.net Fri Sep 6 18:12:51 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Fri, 06 Sep 2002 10:12:51 -0700 Subject: [Spambayes-checkins] spambayes timtest.py,1.4,1.5 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv4345 Modified Files: timtest.py Log Message: Included commented-out code for Anthony Baxter's mondo cool "count the # of headers" idea. Index: timtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timtest.py,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** timtest.py 6 Sep 2002 04:41:13 -0000 1.4 --- timtest.py 6 Sep 2002 17:12:49 -0000 1.5 *************** *** 595,598 **** --- 595,618 ---- yield "bool:noorg" + # XXX Following is a great idea due to Anthony Baxter. I can't use it + # XXX on my test data because the header lines are so different between + # XXX my ham and spam that it makes a large improvement for bogus + # XXX reasons. So it's commented out. But it's clearly a good thing + # XXX to do on "normal" data, and subsumes the Organization trick above + # XXX in a much more general way, yet at comparable cost. + ### X-UIDL: + ### Anthony Baxter's idea. This has spamprob 0.99! The value is clearly + ### irrelevant, just the presence or absence matters. However, it's + ### extremely rare in my spam sets, so doesn't have much value. + ### + ### As also suggested by Anthony, we can capture all such header oddities + ### just by generating tags for the count of how many times each header + ### field appears. + ##x2n = {} + ##for x in msg.keys(): + ## x2n[x] = x2n.get(x, 0) + 1 + ##for x in x2n.items(): + ## yield "header:%s:%d" % x + # Find, decode (base64, qp), and tokenize the textual parts of the body. for part in textparts(msg): From tim_one@users.sourceforge.net Fri Sep 6 18:33:28 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Fri, 06 Sep 2002 10:33:28 -0700 Subject: [Spambayes-checkins] spambayes timtoken.py,NONE,1.1 README.txt,1.3,1.4 timtest.py,1.5,1.6 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv10727 Modified Files: README.txt timtest.py Added Files: timtoken.py Log Message: Split all knowledge of tokenization out of timtest.py and into a new timtoken.py. You can use any tokenize() function you like now. --- NEW FILE: timtoken.py --- import re import email from email import message_from_string from sets import Set __all__ = ['tokenize'] # Find all the text components of the msg. There's no point decoding # binary blobs (like images). If a multipart/alternative has both plain # text and HTML versions of a msg, ignore the HTML part: HTML decorations # have monster-high spam probabilities, and innocent newbies often post # using HTML. def textparts(msg): text = Set() redundant_html = Set() for part in msg.walk(): if part.get_content_type() == 'multipart/alternative': # Descend this part of the tree, adding any redundant HTML text # part to redundant_html. htmlpart = textpart = None stack = part.get_payload() while stack: subpart = stack.pop() ctype = subpart.get_content_type() if ctype == 'text/plain': textpart = subpart elif ctype == 'text/html': htmlpart = subpart elif ctype == 'multipart/related': stack.extend(subpart.get_payload()) if textpart is not None: text.add(textpart) if htmlpart is not None: redundant_html.add(htmlpart) elif htmlpart is not None: text.add(htmlpart) elif part.get_content_maintype() == 'text': text.add(part) return text - redundant_html ############################################################################## # To fold case or not to fold case? I didn't want to fold case, because # it hides information in English, and I have no idea what .lower() does # to other languages; and, indeed, 'FREE' (all caps) turned out to be one # of the strongest spam indicators in my content-only tests (== one with # prob 0.99 *and* made it into spamprob's nbest list very often). # # Against preservering case, it makes the database size larger, and requires # more training data to get enough "representative" mixed-case examples. # # Running my c.l.py tests didn't support my intuition that case was # valuable, so it's getting folded away now. Folding or not made no # significant difference to the false positive rate, and folding made a # small (but statistically significant all the same) reduction in the # false negative rate. There is one obvious difference: after folding # case, conference announcements no longer got high spam scores. Their # content was usually fine, but they were highly penalized for VISIT OUR # WEBSITE FOR MORE INFORMATION! kinds of repeated SCREAMING. That is # indeed the language of advertising, and I halfway regret that folding # away case no longer picks on them. # # Since the f-p rate didn't change, but conference announcements escaped # that category, something else took their place. It seems to be highly # off-topic messages, like debates about Microsoft's place in the world. # Talk about "money" and "lucrative" is indistinguishable now from talk # about "MONEY" and "LUCRATIVE", and spam mentions MONEY a lot. ############################################################################## # Character n-grams or words? # # With careful multiple-corpora c.l.py tests sticking to case-folded decoded # text-only portions, and ignoring headers, and with identical special # parsing & tagging of embedded URLs: # # Character 3-grams gave 5x as many false positives as split-on-whitespace # (s-o-w). The f-n rate was also significantly worse, but within a factor # of 2. So character 3-grams lost across the board. # # Character 5-grams gave 32% more f-ps than split-on-whitespace, but the # s-o-w fp rate across 20,000 presumed-hams was 0.1%, and this is the # difference between 23 and 34 f-ps. There aren't enough there to say that's # significnatly more with killer-high confidence. There were plenty of f-ns, # though, and the f-n rate with character 5-grams was substantially *worse* # than with character 3-grams (which in turn was substantially worse than # with s-o-w). # # Training on character 5-grams creates many more unique tokens than s-o-w: # a typical run bloated to 150MB process size. It also ran a lot slower than # s-o-w, partly related to heavy indexing of a huge out-of-cache wordinfo # dict. I rarely noticed disk activity when running s-o-w, so rarely bothered # to look at process size; it was under 30MB last time I looked. # # Figuring out *why* a msg scored as it did proved much more mysterious when # working with character n-grams: they often had no obvious "meaning". In # contrast, it was always easy to figure out what s-o-w was picking up on. # 5-grams flagged a msg from Christian Tismer as spam, where he was discussing # the speed of tasklets under his new implementation of stackless: # # prob = 0.99999998959 # prob('ed sw') = 0.01 # prob('http0:pgp') = 0.01 # prob('http0:python') = 0.01 # prob('hlon ') = 0.99 # prob('http0:wwwkeys') = 0.01 # prob('http0:starship') = 0.01 # prob('http0:stackless') = 0.01 # prob('n xp ') = 0.99 # prob('on xp') = 0.99 # prob('p 150') = 0.99 # prob('lon x') = 0.99 # prob(' amd ') = 0.99 # prob(' xp 1') = 0.99 # prob(' athl') = 0.99 # prob('1500+') = 0.99 # prob('xp 15') = 0.99 # # The spam decision was baffling until I realized that *all* the high- # probablity spam 5-grams there came out of a single phrase: # # AMD Athlon XP 1500+ # # So Christian was punished for using a machine lots of spam tries to sell # . In a classic Bayesian classifier, this probably wouldn't have # mattered, but Graham's throws away almost all the 5-grams from a msg, # saving only the about-a-dozen farthest from a neutral 0.5. So one bad # phrase can kill you! This appears to happen very rarely, but happened # more than once. # # The conclusion is that character n-grams have almost nothing to recommend # them under Graham's scheme: harder to work with, slower, much larger # database, worse results, and prone to rare mysterious disasters. # # There's one area they won hands-down: detecting spam in what I assume are # Asian languages. The s-o-w scheme sometimes finds only line-ends to split # on then, and then a "hey, this 'word' is way too big! let's ignore it" # gimmick kicks in, and produces no tokens at all. # # [Later: we produce character 5-grams then under the s-o-w scheme, instead # ignoring the blob, but only if there are high-bit characters in the blob; # e.g., there's no point 5-gramming uuencoded lines, and doing so would # bloat the database size.] # # Interesting: despite that odd example above, the *kinds* of f-p mistakes # 5-grams made were very much like s-o-w made -- I recognized almost all of # the 5-gram f-p messages from previous s-o-w runs. For example, both # schemes have a particular hatred for conference announcements, although # s-o-w stopped hating them after folding case. But 5-grams still hate them. # Both schemes also hate msgs discussing HTML with examples, with about equal # passion. Both schemes hate brief "please subscribe [unsubscribe] me" # msgs, although 5-grams seems to hate them more. ############################################################################## # How to tokenize? # # I started with string.split() merely for speed. Over time I realized it # was making interesting context distinctions qualitatively akin to n-gram # schemes; e.g., "free!!" is a much stronger spam indicator than "free". But # unlike n-grams (whether word- or character- based) under Graham's scoring # scheme, this mild context dependence never seems to go over the edge in # giving "too much" credence to an unlucky phrase. # # OTOH, compared to "searching for words", it increases the size of the # database substantially, less than but close to a factor of 2. This is very # much less than a word bigram scheme bloats it, but as always an increase # isn't justified unless the results are better. # # Following are stats comparing # # for token in text.split(): # left column # # to # # for token in re.findall(r"[\w$\-\x80-\xff]+", text): # right column # # text is case-normalized (text.lower()) in both cases, and the runs were # identical in all other respects. The results clearly favor the split() # gimmick, although they vaguely suggest that some sort of compromise # may do as well with less database burden; e.g., *perhaps* folding runs of # "punctuation" characters into a canonical representative could do that. # But the database size is reasonable without that, and plain split() avoids # having to worry about how to "fold punctuation" in languages other than # English. # # false positive percentages # 0.000 0.000 tied # 0.000 0.050 lost # 0.050 0.150 lost # 0.000 0.025 lost # 0.025 0.050 lost # 0.025 0.075 lost # 0.050 0.150 lost # 0.025 0.000 won # 0.025 0.075 lost # 0.000 0.025 lost # 0.075 0.150 lost # 0.050 0.050 tied # 0.025 0.050 lost # 0.000 0.025 lost # 0.050 0.025 won # 0.025 0.000 won # 0.025 0.025 tied # 0.000 0.025 lost # 0.025 0.075 lost # 0.050 0.175 lost # # won 3 times # tied 3 times # lost 14 times # # total unique fp went from 8 to 20 # # false negative percentages # 0.945 1.200 lost # 0.836 1.018 lost # 1.200 1.200 tied # 1.418 1.636 lost # 1.455 1.418 won # 1.091 1.309 lost # 1.091 1.272 lost # 1.236 1.563 lost # 1.564 1.855 lost # 1.236 1.491 lost # 1.563 1.599 lost # 1.563 1.781 lost # 1.236 1.709 lost # 0.836 0.982 lost # 0.873 1.382 lost # 1.236 1.527 lost # 1.273 1.418 lost # 1.018 1.273 lost # 1.091 1.091 tied # 1.490 1.454 won # # won 2 times # tied 2 times # lost 16 times # # total unique fn went from 292 to 302 ############################################################################## # What about HTML? # # Computer geeks seem to view use of HTML in mailing lists and newsgroups as # a mortal sin. Normal people don't, but so it goes: in a technical list/ # group, every HTML decoration has spamprob 0.99, there are lots of unique # HTML decorations, and lots of them appear at the very start of the message # so that Graham's scoring scheme latches on to them tight. As a result, # any plain text message just containing an HTML example is likely to be # judged spam (every HTML decoration is an extreme). # # So if a message is multipart/alternative with both text/plain and text/html # branches, we ignore the latter, else newbies would never get a message # through. If a message is just HTML, it has virtually no chance of getting # through. # # In an effort to let normal people use mailing lists too , and to # alleviate the woes of messages merely *discussing* HTML practice, I # added a gimmick to strip HTML tags after case-normalization and after # special tagging of embedded URLs. This consisted of a regexp sub pattern, # where instances got replaced by single blanks: # # html_re = re.compile(r""" # < # [^\s<>] # e.g., don't match 'a < b' or '<<<' or 'i << 5' or 'a<>b' # [^>]{0,128} # search for the end '>', but don't chew up the world # > # """, re.VERBOSE) # # and then # # text = html_re.sub(' ', text) # # Alas, little good came of this: # # false positive percentages # 0.000 0.000 tied # 0.000 0.000 tied # 0.050 0.075 lost # 0.000 0.000 tied # 0.025 0.025 tied # 0.025 0.025 tied # 0.050 0.050 tied # 0.025 0.025 tied # 0.025 0.025 tied # 0.000 0.050 lost # 0.075 0.100 lost # 0.050 0.050 tied # 0.025 0.025 tied # 0.000 0.025 lost # 0.050 0.050 tied # 0.025 0.025 tied # 0.025 0.025 tied # 0.000 0.000 tied # 0.025 0.050 lost # 0.050 0.050 tied # # won 0 times # tied 15 times # lost 5 times # # total unique fp went from 8 to 12 # # false negative percentages # 0.945 1.164 lost # 0.836 1.418 lost # 1.200 1.272 lost # 1.418 1.272 won # 1.455 1.273 won # 1.091 1.382 lost # 1.091 1.309 lost # 1.236 1.381 lost # 1.564 1.745 lost # 1.236 1.564 lost # 1.563 1.781 lost # 1.563 1.745 lost # 1.236 1.455 lost # 0.836 0.982 lost # 0.873 1.309 lost # 1.236 1.381 lost # 1.273 1.273 tied # 1.018 1.273 lost # 1.091 1.200 lost # 1.490 1.599 lost # # won 2 times # tied 1 times # lost 17 times # # total unique fn went from 292 to 327 # # The messages merely discussing HTML were no longer fps, so it did what it # intended there. But the f-n rate nearly doubled on at least one run -- so # strong a set of spam indicators is the mere presence of HTML. The increase # in the number of fps despite that the HTML-discussing msgs left that # category remains mysterious to me, but it wasn't a significant increase # so I let it drop. # # Later: If I simply give up on making mailing lists friendly to my sisters # (they're not nerds, and create wonderfully attractive HTML msgs), a # compromise is to strip HTML tags from only text/plain msgs. That's # principled enough so far as it goes, and eliminates the HTML-discussing # false positives. It remains disturbing that the f-n rate on pure HTML # msgs increases significantly when stripping tags, so the code here doesn't # do that part. However, even after stripping tags, the rates above show that # at least 98% of spams are still correctly identified as spam. # XXX So, if another way is found to slash the f-n rate, the decision here # XXX not to strip HTML from HTML-only msgs should be revisited. url_re = re.compile(r""" (https? | ftp) # capture the protocol :// # skip the boilerplate # Do a reasonable attempt at detecting the end. It may or may not # be in HTML, may or may not be in quotes, etc. If it's full of % # escapes, cool -- that's a clue too. ([^\s<>'"\x7f-\xff]+) # capture the guts """, re.VERBOSE) urlsep_re = re.compile(r"[;?:@&=+,$.]") has_highbit_char = re.compile(r"[\x80-\xff]").search # Cheap-ass gimmick to probabilistically find HTML/XML tags. html_re = re.compile(r""" < [^\s<>] # e.g., don't match 'a < b' or '<<<' or 'i << 5' or 'a<>b' [^>]{0,128} # search for the end '>', but don't run wild > """, re.VERBOSE) # I'm usually just splitting on whitespace, but for subject lines I want to # break things like "Python/Perl comparison?" up. OTOH, I don't want to # break up the unitized numbers in spammish subject phrases like "Increase # size 79%" or "Now only $29.95!". Then again, I do want to break up # "Python-Dev". subject_word_re = re.compile(r"[\w\x80-\xff$.%]+") def tokenize_word(word, _len=len): n = _len(word) # XXX How big should "a word" be? # XXX I expect 12 is fine -- a test run boosting to 13 had no effect # XXX on f-p rate, and did a little better or worse than 12 across # XXX runs -- overall, no significant difference. It's only "common # XXX sense" so far driving the exclusion of lengths 1 and 2. # Make sure this range matches in tokenize(). if 3 <= n <= 12: yield word elif n >= 3: # A long word. # Don't want to skip embedded email addresses. if n < 40 and '.' in word and word.count('@') == 1: p1, p2 = word.split('@') yield 'email name:' + p1 for piece in p2.split('.'): yield 'email addr:' + piece # If there are any high-bit chars, # tokenize it as byte 5-grams. # XXX This really won't work for high-bit languages -- the scoring # XXX scheme throws almost everything away, and one bad phrase can # XXX generate enough bad 5-grams to dominate the final score. # XXX This also increases the database size substantially. elif has_highbit_char(word): for i in xrange(n-4): yield "5g:" + word[i : i+5] else: # It's a long string of "normal" chars. Ignore it. # For example, it may be an embedded URL (which we already # tagged), or a uuencoded line. # There's value in generating a token indicating roughly how # many chars were skipped. This has real benefit for the f-n # rate, but is neutral for the f-p rate. I don't know why! # XXX Figure out why, and/or see if some other way of summarizing # XXX this info has greater benefit. yield "skip:%c %d" % (word[0], n // 10 * 10) # Generate tokens for: # Content-Type # and its type= param # Content-Dispostion # and its filename= param # Content-Transfer-Encoding # all the charsets # # This has huge benefit for the f-n rate, and virtually none on the f-p rate, # although it does reduce the variance of the f-p rate across different # training sets (really marginal msgs, like a brief HTML msg saying just # "unsubscribe me", are almost always tagged as spam now; before they were # right on the edge, and now the multipart/alternative pushes them over it # more consistently). # # XXX I put all of this in as one chunk. I don't know which parts are # XXX most effective; it could be that some parts don't help at all. But # XXX given the nature of the c.l.py tests, it's not surprising that the # XXX 'content-type:text/html' # XXX token is now the single most powerful spam indicator (== makes it # XXX into the nbest list most often). What *is* a little surprising is # XXX that this doesn't push more mixed-type msgs into the f-p camp -- # XXX unlike looking at *all* HTML tags, this is just one spam indicator # XXX instead of dozens, so relevant msg content can cancel it out. def crack_content_xyz(msg): x = msg.get_type() if x is not None: yield 'content-type:' + x.lower() x = msg.get_param('type') if x is not None: yield 'content-type/type:' + x.lower() for x in msg.get_charsets(None): if x is not None: yield 'charset:' + x.lower() x = msg.get('content-disposition') if x is not None: yield 'content-disposition:' + x.lower() fname = msg.get_filename() if fname is not None: for x in fname.lower().split('/'): for y in x.split('.'): yield 'filename:' + y x = msg.get('content-transfer-encoding:') if x is not None: yield 'content-transfer-encoding:' + x.lower() def tokenize(string): # Create an email Message object. try: msg = message_from_string(string) except email.Errors.MessageParseError: yield 'control: MessageParseError' # XXX Fall back to the raw body text? return # Special tagging of header lines. # XXX TODO Neil Schemenauer has gotten a good start on this (pvt email). # XXX The headers in my spam and ham corpora are so different (they came # XXX from different sources) that if I include them the classifier's # XXX job is trivial. Only some "safe" header lines are included here, # XXX where "safe" is specific to my sorry corpora. # Content-{Transfer-Encoding, Type, Disposition} and their params. t = '' for x in msg.walk(): for w in crack_content_xyz(x): yield t + w t = '>' # Subject: # Don't ignore case in Subject lines; e.g., 'free' versus 'FREE' is # especially significant in this context. Experiment showed a small # but real benefit to keeping case intact in this specific context. x = msg.get('subject', '') for w in subject_word_re.findall(x): for t in tokenize_word(w): yield 'subject:' + t # Dang -- I can't use Sender:. If I do, # 'sender:email name:python-list-admin' # becomes the most powerful indicator in the whole database. # # From: # Reply-To: for field in ('from',):# 'reply-to',): prefix = field + ':' x = msg.get(field, 'none').lower() for w in x.split(): for t in tokenize_word(w): yield prefix + t # These headers seem to work best if they're not tokenized: just # normalize case and whitespace. # X-Mailer: This is a pure and significant win for the f-n rate; f-p # rate isn't affected. # User-Agent: Skipping it, as it made no difference. Very few spams # had a User-Agent field, but lots of hams didn't either, # and the spam probability of User-Agent was very close to # 0.5 (== not a valuable discriminator) across all training # sets. for field in ('x-mailer',): prefix = field + ':' x = msg.get(field, 'none').lower() yield prefix + ' '.join(x.split()) # Organization: # Oddly enough, tokenizing this doesn't make any difference to results. # However, noting its mere absence is strong enough to give a tiny # improvement in the f-n rate, and since recording that requires only # one token across the whole database, the cost is also tiny. if msg.get('organization', None) is None: yield "bool:noorg" # XXX Following is a great idea due to Anthony Baxter. I can't use it # XXX on my test data because the header lines are so different between # XXX my ham and spam that it makes a large improvement for bogus # XXX reasons. So it's commented out. But it's clearly a good thing # XXX to do on "normal" data, and subsumes the Organization trick above # XXX in a much more general way, yet at comparable cost. ### X-UIDL: ### Anthony Baxter's idea. This has spamprob 0.99! The value is clearly ### irrelevant, just the presence or absence matters. However, it's ### extremely rare in my spam sets, so doesn't have much value. ### ### As also suggested by Anthony, we can capture all such header oddities ### just by generating tags for the count of how many times each header ### field appears. ##x2n = {} ##for x in msg.keys(): ## x2n[x] = x2n.get(x, 0) + 1 ##for x in x2n.items(): ## yield "header:%s:%d" % x # Find, decode (base64, qp), and tokenize the textual parts of the body. for part in textparts(msg): # Decode, or take it as-is if decoding fails. try: text = part.get_payload(decode=True) except: yield "control: couldn't decode" text = part.get_payload(decode=False) if text is None: yield 'control: payload is None' continue # Normalize case. text = text.lower() # Special tagging of embedded URLs. for proto, guts in url_re.findall(text): yield "proto:" + proto # Lose the trailing punctuation for casual embedding, like: # The code is at http://mystuff.org/here? Didn't resolve. # or # I found it at http://mystuff.org/there/. Thanks! assert guts while guts and guts[-1] in '.:?!/': guts = guts[:-1] for i, piece in enumerate(guts.split('/')): prefix = "%s%s:" % (proto, i < 2 and str(i) or '>1') for chunk in urlsep_re.split(piece): yield prefix + chunk # Remove HTML/XML tags if it's a plain text message. if part.get_content_type() == "text/plain": text = html_re.sub(' ', text) # Tokenize everything. for w in text.split(): n = len(w) # Make sure this range matches in tokenize_word(). if 3 <= n <= 12: yield w elif n >= 3: for t in tokenize_word(w): yield t Index: README.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/README.txt,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** README.txt 5 Sep 2002 23:42:52 -0000 1.3 --- README.txt 6 Sep 2002 17:33:25 -0000 1.4 *************** *** 27,34 **** of false positives and false negatives. timtest.py ! A concrete test driver and tokenizer that uses Tester and ! classifier (above). This assumes "a standard" test data setup ! (see below). Could stand massive refactoring. GBayes.py --- 27,39 ---- of false positives and false negatives. + timtoken.py + Am implementation of tokenize() that Tim can't seem to help but keep + working on . + timtest.py ! A concrete test driver that uses Tester and classifier (above). This ! assumes "a standard" test data setup (see below). Could stand massive ! refactoring. You need to fiddle a line near the top to import a ! tokenize() function of your choosing. GBayes.py Index: timtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timtest.py,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** timtest.py 6 Sep 2002 17:12:49 -0000 1.5 --- timtest.py 6 Sep 2002 17:33:26 -0000 1.6 *************** *** 7,18 **** import os - import re from sets import Set - import email - from email import message_from_string import cPickle as pickle import Tester import classifier class Hist: --- 7,16 ---- import os from sets import Set import cPickle as pickle import Tester import classifier + from timtoken import tokenize class Hist: *************** *** 57,663 **** print "Spam distribution for", tag spam.display() - - # Find all the text components of the msg. There's no point decoding - # binary blobs (like images). If a multipart/alternative has both plain - # text and HTML versions of a msg, ignore the HTML part: HTML decorations - # have monster-high spam probabilities, and innocent newbies often post - # using HTML. - def textparts(msg): - text = Set() - redundant_html = Set() - for part in msg.walk(): - if part.get_content_type() == 'multipart/alternative': - # Descend this part of the tree, adding any redundant HTML text - # part to redundant_html. - htmlpart = textpart = None - stack = part.get_payload() - while stack: - subpart = stack.pop() - ctype = subpart.get_content_type() - if ctype == 'text/plain': - textpart = subpart - elif ctype == 'text/html': - htmlpart = subpart - elif ctype == 'multipart/related': - stack.extend(subpart.get_payload()) - - if textpart is not None: - text.add(textpart) - if htmlpart is not None: - redundant_html.add(htmlpart) - elif htmlpart is not None: - text.add(htmlpart) - - elif part.get_content_maintype() == 'text': - text.add(part) - - return text - redundant_html - - ############################################################################## - # To fold case or not to fold case? I didn't want to fold case, because - # it hides information in English, and I have no idea what .lower() does - # to other languages; and, indeed, 'FREE' (all caps) turned out to be one - # of the strongest spam indicators in my content-only tests (== one with - # prob 0.99 *and* made it into spamprob's nbest list very often). - # - # Against preservering case, it makes the database size larger, and requires - # more training data to get enough "representative" mixed-case examples. - # - # Running my c.l.py tests didn't support my intuition that case was - # valuable, so it's getting folded away now. Folding or not made no - # significant difference to the false positive rate, and folding made a - # small (but statistically significant all the same) reduction in the - # false negative rate. There is one obvious difference: after folding - # case, conference announcements no longer got high spam scores. Their - # content was usually fine, but they were highly penalized for VISIT OUR - # WEBSITE FOR MORE INFORMATION! kinds of repeated SCREAMING. That is - # indeed the language of advertising, and I halfway regret that folding - # away case no longer picks on them. - # - # Since the f-p rate didn't change, but conference announcements escaped - # that category, something else took their place. It seems to be highly - # off-topic messages, like debates about Microsoft's place in the world. - # Talk about "money" and "lucrative" is indistinguishable now from talk - # about "MONEY" and "LUCRATIVE", and spam mentions MONEY a lot. - - - ############################################################################## - # Character n-grams or words? - # - # With careful multiple-corpora c.l.py tests sticking to case-folded decoded - # text-only portions, and ignoring headers, and with identical special - # parsing & tagging of embedded URLs: - # - # Character 3-grams gave 5x as many false positives as split-on-whitespace - # (s-o-w). The f-n rate was also significantly worse, but within a factor - # of 2. So character 3-grams lost across the board. - # - # Character 5-grams gave 32% more f-ps than split-on-whitespace, but the - # s-o-w fp rate across 20,000 presumed-hams was 0.1%, and this is the - # difference between 23 and 34 f-ps. There aren't enough there to say that's - # significnatly more with killer-high confidence. There were plenty of f-ns, - # though, and the f-n rate with character 5-grams was substantially *worse* - # than with character 3-grams (which in turn was substantially worse than - # with s-o-w). - # - # Training on character 5-grams creates many more unique tokens than s-o-w: - # a typical run bloated to 150MB process size. It also ran a lot slower than - # s-o-w, partly related to heavy indexing of a huge out-of-cache wordinfo - # dict. I rarely noticed disk activity when running s-o-w, so rarely bothered - # to look at process size; it was under 30MB last time I looked. - # - # Figuring out *why* a msg scored as it did proved much more mysterious when - # working with character n-grams: they often had no obvious "meaning". In - # contrast, it was always easy to figure out what s-o-w was picking up on. - # 5-grams flagged a msg from Christian Tismer as spam, where he was discussing - # the speed of tasklets under his new implementation of stackless: - # - # prob = 0.99999998959 - # prob('ed sw') = 0.01 - # prob('http0:pgp') = 0.01 - # prob('http0:python') = 0.01 - # prob('hlon ') = 0.99 - # prob('http0:wwwkeys') = 0.01 - # prob('http0:starship') = 0.01 - # prob('http0:stackless') = 0.01 - # prob('n xp ') = 0.99 - # prob('on xp') = 0.99 - # prob('p 150') = 0.99 - # prob('lon x') = 0.99 - # prob(' amd ') = 0.99 - # prob(' xp 1') = 0.99 - # prob(' athl') = 0.99 - # prob('1500+') = 0.99 - # prob('xp 15') = 0.99 - # - # The spam decision was baffling until I realized that *all* the high- - # probablity spam 5-grams there came out of a single phrase: - # - # AMD Athlon XP 1500+ - # - # So Christian was punished for using a machine lots of spam tries to sell - # . In a classic Bayesian classifier, this probably wouldn't have - # mattered, but Graham's throws away almost all the 5-grams from a msg, - # saving only the about-a-dozen farthest from a neutral 0.5. So one bad - # phrase can kill you! This appears to happen very rarely, but happened - # more than once. - # - # The conclusion is that character n-grams have almost nothing to recommend - # them under Graham's scheme: harder to work with, slower, much larger - # database, worse results, and prone to rare mysterious disasters. - # - # There's one area they won hands-down: detecting spam in what I assume are - # Asian languages. The s-o-w scheme sometimes finds only line-ends to split - # on then, and then a "hey, this 'word' is way too big! let's ignore it" - # gimmick kicks in, and produces no tokens at all. - # - # [Later: we produce character 5-grams then under the s-o-w scheme, instead - # ignoring the blob, but only if there are high-bit characters in the blob; - # e.g., there's no point 5-gramming uuencoded lines, and doing so would - # bloat the database size.] - # - # Interesting: despite that odd example above, the *kinds* of f-p mistakes - # 5-grams made were very much like s-o-w made -- I recognized almost all of - # the 5-gram f-p messages from previous s-o-w runs. For example, both - # schemes have a particular hatred for conference announcements, although - # s-o-w stopped hating them after folding case. But 5-grams still hate them. - # Both schemes also hate msgs discussing HTML with examples, with about equal - # passion. Both schemes hate brief "please subscribe [unsubscribe] me" - # msgs, although 5-grams seems to hate them more. - - - ############################################################################## - # How to tokenize? - # - # I started with string.split() merely for speed. Over time I realized it - # was making interesting context distinctions qualitatively akin to n-gram - # schemes; e.g., "free!!" is a much stronger spam indicator than "free". But - # unlike n-grams (whether word- or character- based) under Graham's scoring - # scheme, this mild context dependence never seems to go over the edge in - # giving "too much" credence to an unlucky phrase. - # - # OTOH, compared to "searching for words", it increases the size of the - # database substantially, less than but close to a factor of 2. This is very - # much less than a word bigram scheme bloats it, but as always an increase - # isn't justified unless the results are better. - # - # Following are stats comparing - # - # for token in text.split(): # left column - # - # to - # - # for token in re.findall(r"[\w$\-\x80-\xff]+", text): # right column - # - # text is case-normalized (text.lower()) in both cases, and the runs were - # identical in all other respects. The results clearly favor the split() - # gimmick, although they vaguely suggest that some sort of compromise - # may do as well with less database burden; e.g., *perhaps* folding runs of - # "punctuation" characters into a canonical representative could do that. - # But the database size is reasonable without that, and plain split() avoids - # having to worry about how to "fold punctuation" in languages other than - # English. - # - # false positive percentages - # 0.000 0.000 tied - # 0.000 0.050 lost - # 0.050 0.150 lost - # 0.000 0.025 lost - # 0.025 0.050 lost - # 0.025 0.075 lost - # 0.050 0.150 lost - # 0.025 0.000 won - # 0.025 0.075 lost - # 0.000 0.025 lost - # 0.075 0.150 lost - # 0.050 0.050 tied - # 0.025 0.050 lost - # 0.000 0.025 lost - # 0.050 0.025 won - # 0.025 0.000 won - # 0.025 0.025 tied - # 0.000 0.025 lost - # 0.025 0.075 lost - # 0.050 0.175 lost - # - # won 3 times - # tied 3 times - # lost 14 times - # - # total unique fp went from 8 to 20 - # - # false negative percentages - # 0.945 1.200 lost - # 0.836 1.018 lost - # 1.200 1.200 tied - # 1.418 1.636 lost - # 1.455 1.418 won - # 1.091 1.309 lost - # 1.091 1.272 lost - # 1.236 1.563 lost - # 1.564 1.855 lost - # 1.236 1.491 lost - # 1.563 1.599 lost - # 1.563 1.781 lost - # 1.236 1.709 lost - # 0.836 0.982 lost - # 0.873 1.382 lost - # 1.236 1.527 lost - # 1.273 1.418 lost - # 1.018 1.273 lost - # 1.091 1.091 tied - # 1.490 1.454 won - # - # won 2 times - # tied 2 times - # lost 16 times - # - # total unique fn went from 292 to 302 - - - ############################################################################## - # What about HTML? - # - # Computer geeks seem to view use of HTML in mailing lists and newsgroups as - # a mortal sin. Normal people don't, but so it goes: in a technical list/ - # group, every HTML decoration has spamprob 0.99, there are lots of unique - # HTML decorations, and lots of them appear at the very start of the message - # so that Graham's scoring scheme latches on to them tight. As a result, - # any plain text message just containing an HTML example is likely to be - # judged spam (every HTML decoration is an extreme). - # - # So if a message is multipart/alternative with both text/plain and text/html - # branches, we ignore the latter, else newbies would never get a message - # through. If a message is just HTML, it has virtually no chance of getting - # through. - # - # In an effort to let normal people use mailing lists too , and to - # alleviate the woes of messages merely *discussing* HTML practice, I - # added a gimmick to strip HTML tags after case-normalization and after - # special tagging of embedded URLs. This consisted of a regexp sub pattern, - # where instances got replaced by single blanks: - # - # html_re = re.compile(r""" - # < - # [^\s<>] # e.g., don't match 'a < b' or '<<<' or 'i << 5' or 'a<>b' - # [^>]{0,128} # search for the end '>', but don't chew up the world - # > - # """, re.VERBOSE) - # - # and then - # - # text = html_re.sub(' ', text) - # - # Alas, little good came of this: - # - # false positive percentages - # 0.000 0.000 tied - # 0.000 0.000 tied - # 0.050 0.075 lost - # 0.000 0.000 tied - # 0.025 0.025 tied - # 0.025 0.025 tied - # 0.050 0.050 tied - # 0.025 0.025 tied - # 0.025 0.025 tied - # 0.000 0.050 lost - # 0.075 0.100 lost - # 0.050 0.050 tied - # 0.025 0.025 tied - # 0.000 0.025 lost - # 0.050 0.050 tied - # 0.025 0.025 tied - # 0.025 0.025 tied - # 0.000 0.000 tied - # 0.025 0.050 lost - # 0.050 0.050 tied - # - # won 0 times - # tied 15 times - # lost 5 times - # - # total unique fp went from 8 to 12 - # - # false negative percentages - # 0.945 1.164 lost - # 0.836 1.418 lost - # 1.200 1.272 lost - # 1.418 1.272 won - # 1.455 1.273 won - # 1.091 1.382 lost - # 1.091 1.309 lost - # 1.236 1.381 lost - # 1.564 1.745 lost - # 1.236 1.564 lost - # 1.563 1.781 lost - # 1.563 1.745 lost - # 1.236 1.455 lost - # 0.836 0.982 lost - # 0.873 1.309 lost - # 1.236 1.381 lost - # 1.273 1.273 tied - # 1.018 1.273 lost - # 1.091 1.200 lost - # 1.490 1.599 lost - # - # won 2 times - # tied 1 times - # lost 17 times - # - # total unique fn went from 292 to 327 - # - # The messages merely discussing HTML were no longer fps, so it did what it - # intended there. But the f-n rate nearly doubled on at least one run -- so - # strong a set of spam indicators is the mere presence of HTML. The increase - # in the number of fps despite that the HTML-discussing msgs left that - # category remains mysterious to me, but it wasn't a significant increase - # so I let it drop. - # - # Later: If I simply give up on making mailing lists friendly to my sisters - # (they're not nerds, and create wonderfully attractive HTML msgs), a - # compromise is to strip HTML tags from only text/plain msgs. That's - # principled enough so far as it goes, and eliminates the HTML-discussing - # false positives. It remains disturbing that the f-n rate on pure HTML - # msgs increases significantly when stripping tags, so the code here doesn't - # do that part. However, even after stripping tags, the rates above show that - # at least 98% of spams are still correctly identified as spam. - # XXX So, if another way is found to slash the f-n rate, the decision here - # XXX not to strip HTML from HTML-only msgs should be revisited. - - url_re = re.compile(r""" - (https? | ftp) # capture the protocol - :// # skip the boilerplate - # Do a reasonable attempt at detecting the end. It may or may not - # be in HTML, may or may not be in quotes, etc. If it's full of % - # escapes, cool -- that's a clue too. - ([^\s<>'"\x7f-\xff]+) # capture the guts - """, re.VERBOSE) - - urlsep_re = re.compile(r"[;?:@&=+,$.]") - - has_highbit_char = re.compile(r"[\x80-\xff]").search - - # Cheap-ass gimmick to probabilistically find HTML/XML tags. - html_re = re.compile(r""" - < - [^\s<>] # e.g., don't match 'a < b' or '<<<' or 'i << 5' or 'a<>b' - [^>]{0,128} # search for the end '>', but don't run wild - > - """, re.VERBOSE) - - # I'm usually just splitting on whitespace, but for subject lines I want to - # break things like "Python/Perl comparison?" up. OTOH, I don't want to - # break up the unitized numbers in spammish subject phrases like "Increase - # size 79%" or "Now only $29.95!". Then again, I do want to break up - # "Python-Dev". - subject_word_re = re.compile(r"[\w\x80-\xff$.%]+") - - def tokenize_word(word, _len=len): - n = _len(word) - - # XXX How big should "a word" be? - # XXX I expect 12 is fine -- a test run boosting to 13 had no effect - # XXX on f-p rate, and did a little better or worse than 12 across - # XXX runs -- overall, no significant difference. It's only "common - # XXX sense" so far driving the exclusion of lengths 1 and 2. - - # Make sure this range matches in tokenize(). - if 3 <= n <= 12: - yield word - - elif n >= 3: - # A long word. - - # Don't want to skip embedded email addresses. - if n < 40 and '.' in word and word.count('@') == 1: - p1, p2 = word.split('@') - yield 'email name:' + p1 - for piece in p2.split('.'): - yield 'email addr:' + piece - - # If there are any high-bit chars, - # tokenize it as byte 5-grams. - # XXX This really won't work for high-bit languages -- the scoring - # XXX scheme throws almost everything away, and one bad phrase can - # XXX generate enough bad 5-grams to dominate the final score. - # XXX This also increases the database size substantially. - elif has_highbit_char(word): - for i in xrange(n-4): - yield "5g:" + word[i : i+5] - - else: - # It's a long string of "normal" chars. Ignore it. - # For example, it may be an embedded URL (which we already - # tagged), or a uuencoded line. - # There's value in generating a token indicating roughly how - # many chars were skipped. This has real benefit for the f-n - # rate, but is neutral for the f-p rate. I don't know why! - # XXX Figure out why, and/or see if some other way of summarizing - # XXX this info has greater benefit. - yield "skip:%c %d" % (word[0], n // 10 * 10) - - # Generate tokens for: - # Content-Type - # and its type= param - # Content-Dispostion - # and its filename= param - # Content-Transfer-Encoding - # all the charsets - # - # This has huge benefit for the f-n rate, and virtually none on the f-p rate, - # although it does reduce the variance of the f-p rate across different - # training sets (really marginal msgs, like a brief HTML msg saying just - # "unsubscribe me", are almost always tagged as spam now; before they were - # right on the edge, and now the multipart/alternative pushes them over it - # more consistently). - # - # XXX I put all of this in as one chunk. I don't know which parts are - # XXX most effective; it could be that some parts don't help at all. But - # XXX given the nature of the c.l.py tests, it's not surprising that the - # XXX 'content-type:text/html' - # XXX token is now the single most powerful spam indicator (== makes it - # XXX into the nbest list most often). What *is* a little surprising is - # XXX that this doesn't push more mixed-type msgs into the f-p camp -- - # XXX unlike looking at *all* HTML tags, this is just one spam indicator - # XXX instead of dozens, so relevant msg content can cancel it out. - def crack_content_xyz(msg): - x = msg.get_type() - if x is not None: - yield 'content-type:' + x.lower() - - x = msg.get_param('type') - if x is not None: - yield 'content-type/type:' + x.lower() - - for x in msg.get_charsets(None): - if x is not None: - yield 'charset:' + x.lower() - - x = msg.get('content-disposition') - if x is not None: - yield 'content-disposition:' + x.lower() - - fname = msg.get_filename() - if fname is not None: - for x in fname.lower().split('/'): - for y in x.split('.'): - yield 'filename:' + y - - x = msg.get('content-transfer-encoding:') - if x is not None: - yield 'content-transfer-encoding:' + x.lower() - - def tokenize(string): - # Create an email Message object. - try: - msg = message_from_string(string) - except email.Errors.MessageParseError: - yield 'control: MessageParseError' - # XXX Fall back to the raw body text? - return - - # Special tagging of header lines. - # XXX TODO Neil Schemenauer has gotten a good start on this (pvt email). - # XXX The headers in my spam and ham corpora are so different (they came - # XXX from different sources) that if I include them the classifier's - # XXX job is trivial. Only some "safe" header lines are included here, - # XXX where "safe" is specific to my sorry corpora. - - # Content-{Transfer-Encoding, Type, Disposition} and their params. - t = '' - for x in msg.walk(): - for w in crack_content_xyz(x): - yield t + w - t = '>' - - # Subject: - # Don't ignore case in Subject lines; e.g., 'free' versus 'FREE' is - # especially significant in this context. Experiment showed a small - # but real benefit to keeping case intact in this specific context. - x = msg.get('subject', '') - for w in subject_word_re.findall(x): - for t in tokenize_word(w): - yield 'subject:' + t - - # Dang -- I can't use Sender:. If I do, - # 'sender:email name:python-list-admin' - # becomes the most powerful indicator in the whole database. - # - # From: - # Reply-To: - for field in ('from',):# 'reply-to',): - prefix = field + ':' - x = msg.get(field, 'none').lower() - for w in x.split(): - for t in tokenize_word(w): - yield prefix + t - - # These headers seem to work best if they're not tokenized: just - # normalize case and whitespace. - # X-Mailer: This is a pure and significant win for the f-n rate; f-p - # rate isn't affected. - # User-Agent: Skipping it, as it made no difference. Very few spams - # had a User-Agent field, but lots of hams didn't either, - # and the spam probability of User-Agent was very close to - # 0.5 (== not a valuable discriminator) across all training - # sets. - for field in ('x-mailer',): - prefix = field + ':' - x = msg.get(field, 'none').lower() - yield prefix + ' '.join(x.split()) - - # Organization: - # Oddly enough, tokenizing this doesn't make any difference to results. - # However, noting its mere absence is strong enough to give a tiny - # improvement in the f-n rate, and since recording that requires only - # one token across the whole database, the cost is also tiny. - if msg.get('organization', None) is None: - yield "bool:noorg" - - # XXX Following is a great idea due to Anthony Baxter. I can't use it - # XXX on my test data because the header lines are so different between - # XXX my ham and spam that it makes a large improvement for bogus - # XXX reasons. So it's commented out. But it's clearly a good thing - # XXX to do on "normal" data, and subsumes the Organization trick above - # XXX in a much more general way, yet at comparable cost. - ### X-UIDL: - ### Anthony Baxter's idea. This has spamprob 0.99! The value is clearly - ### irrelevant, just the presence or absence matters. However, it's - ### extremely rare in my spam sets, so doesn't have much value. - ### - ### As also suggested by Anthony, we can capture all such header oddities - ### just by generating tags for the count of how many times each header - ### field appears. - ##x2n = {} - ##for x in msg.keys(): - ## x2n[x] = x2n.get(x, 0) + 1 - ##for x in x2n.items(): - ## yield "header:%s:%d" % x - - # Find, decode (base64, qp), and tokenize the textual parts of the body. - for part in textparts(msg): - # Decode, or take it as-is if decoding fails. - try: - text = part.get_payload(decode=True) - except: - yield "control: couldn't decode" - text = part.get_payload(decode=False) - - if text is None: - yield 'control: payload is None' - continue - - # Normalize case. - text = text.lower() - - # Special tagging of embedded URLs. - for proto, guts in url_re.findall(text): - yield "proto:" + proto - # Lose the trailing punctuation for casual embedding, like: - # The code is at http://mystuff.org/here? Didn't resolve. - # or - # I found it at http://mystuff.org/there/. Thanks! - assert guts - while guts and guts[-1] in '.:?!/': - guts = guts[:-1] - for i, piece in enumerate(guts.split('/')): - prefix = "%s%s:" % (proto, i < 2 and str(i) or '>1') - for chunk in urlsep_re.split(piece): - yield prefix + chunk - - # Remove HTML/XML tags if it's a plain text message. - if part.get_content_type() == "text/plain": - text = html_re.sub(' ', text) - - # Tokenize everything. - for w in text.split(): - n = len(w) - # Make sure this range matches in tokenize_word(). - if 3 <= n <= 12: - yield w - - elif n >= 3: - for t in tokenize_word(w): - yield t class Msg(object): --- 55,58 ---- From tim_one@users.sourceforge.net Fri Sep 6 20:13:02 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Fri, 06 Sep 2002 12:13:02 -0700 Subject: [Spambayes-checkins] spambayes timtest.py,1.6,1.7 timtoken.py,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv11785 Modified Files: timtest.py timtoken.py Log Message: crack_content_xyz(): A bug prevented Content-Transfer-Encoding from getting picked up. Fixed the bug, and then experiment showed it didn't help, so disabled the corrected code and added a comment block explaining why it's disabled. Index: timtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timtest.py,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** timtest.py 6 Sep 2002 17:33:26 -0000 1.6 --- timtest.py 6 Sep 2002 19:12:59 -0000 1.7 *************** *** 103,109 **** trained_spam_hist = Hist(nbuckets) ! #fp = file('w.pik', 'wb') ! #pickle.dump(c, fp, 1) ! #fp.close() for sd2, hd2 in SPAMHAMDIRS: --- 103,109 ---- trained_spam_hist = Hist(nbuckets) ! fp = file('w.pik', 'wb') ! pickle.dump(c, fp, 1) ! fp.close() for sd2, hd2 in SPAMHAMDIRS: Index: timtoken.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timtoken.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** timtoken.py 6 Sep 2002 17:33:26 -0000 1.1 --- timtoken.py 6 Sep 2002 19:12:59 -0000 1.2 *************** *** 433,437 **** # Content-Dispostion # and its filename= param - # Content-Transfer-Encoding # all the charsets # --- 433,436 ---- *************** *** 452,455 **** --- 451,516 ---- # XXX unlike looking at *all* HTML tags, this is just one spam indicator # XXX instead of dozens, so relevant msg content can cancel it out. + # + # A bug in this code prevented Content-Transfer-Encoding from getting + # picked up. Fixing that bug showed that it didn't helpe, so the corrected + # code is disabled now (left column without Content-Transfer-Encoding, + # right column with it); + # + # false positive percentages + # 0.000 0.000 tied + # 0.000 0.000 tied + # 0.100 0.100 tied + # 0.000 0.000 tied + # 0.025 0.025 tied + # 0.025 0.025 tied + # 0.100 0.100 tied + # 0.025 0.025 tied + # 0.025 0.025 tied + # 0.050 0.050 tied + # 0.100 0.100 tied + # 0.025 0.025 tied + # 0.025 0.025 tied + # 0.025 0.025 tied + # 0.025 0.025 tied + # 0.025 0.025 tied + # 0.025 0.025 tied + # 0.000 0.025 lost +(was 0) + # 0.025 0.025 tied + # 0.100 0.100 tied + # + # won 0 times + # tied 19 times + # lost 1 times + # + # total unique fp went from 9 to 10 + # + # false negative percentages + # 0.364 0.400 lost +9.89% + # 0.400 0.364 won -9.00% + # 0.400 0.436 lost +9.00% + # 0.909 0.872 won -4.07% + # 0.836 0.836 tied + # 0.618 0.618 tied + # 0.291 0.291 tied + # 1.018 0.981 won -3.63% + # 0.982 0.982 tied + # 0.727 0.727 tied + # 0.800 0.800 tied + # 1.163 1.127 won -3.10% + # 0.764 0.836 lost +9.42% + # 0.473 0.473 tied + # 0.473 0.618 lost +30.66% + # 0.727 0.763 lost +4.95% + # 0.655 0.618 won -5.65% + # 0.509 0.473 won -7.07% + # 0.545 0.582 lost +6.79% + # 0.509 0.509 tied + # + # won 6 times + # tied 8 times + # lost 6 times + # + # total unique fn went from 168 to 169 + def crack_content_xyz(msg): x = msg.get_type() *************** *** 475,481 **** yield 'filename:' + y ! x = msg.get('content-transfer-encoding:') ! if x is not None: ! yield 'content-transfer-encoding:' + x.lower() def tokenize(string): --- 536,543 ---- yield 'filename:' + y ! if 0: # disabled; see comment before function ! x = msg.get('content-transfer-encoding') ! if x is not None: ! yield 'content-transfer-encoding:' + x.lower() def tokenize(string): *************** *** 495,499 **** # XXX where "safe" is specific to my sorry corpora. ! # Content-{Transfer-Encoding, Type, Disposition} and their params. t = '' for x in msg.walk(): --- 557,561 ---- # XXX where "safe" is specific to my sorry corpora. ! # Content-{Type, Disposition} and their params, and charsets. t = '' for x in msg.walk(): *************** *** 601,605 **** text = html_re.sub(' ', text) ! # Tokenize everything. for w in text.split(): n = len(w) --- 663,667 ---- text = html_re.sub(' ', text) ! # Tokenize everything in the body. for w in text.split(): n = len(w) From jhylton@users.sourceforge.net Fri Sep 6 20:26:36 2002 From: jhylton@users.sourceforge.net (Jeremy Hylton) Date: Fri, 06 Sep 2002 12:26:36 -0700 Subject: [Spambayes-checkins] spambayes mboxtest.py,NONE,1.1 timtest.py,1.7,1.8 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv16790 Modified Files: timtest.py Added Files: mboxtest.py Log Message: Add a test driver that works with mboxes. This is similar in spirit to timtest, but it works with any old kind of mailbox recognized by the Python mailbox module. One non-trivial difference from timtest: Rather than requiring that the user split the mailbox into separate parts, it selects NSETS different subsets of the mailbox to use for testing. It chooses an arbitrary subset because my mailboxes are sorted by date, and I didn't want to bias tests by choosing training data from a small period of time. The timtest module has grown a Driver() class that is intended to work just like the drive() function, but with a bit more flexibility. The jdrive() function might be able to replace drive(), but I can't test it so I'm not going to replace it. Maybe Tim will try jdrive() and report if it works correctly. I didn't find the MsgStream() class useful outside of timtest, but mailboxes are represented by the mbox class, which is an iterable collection of Msg objects. Renamed the path attribute of Msg to tag, since path doesn't make sense with an mbox. The path was getting used as a human-readable tag for messages, so I synthesized one for mbox messages. --- NEW FILE: mboxtest.py --- #! /usr/bin/env python from timtoken import tokenize from classifier import GrahamBayes from Tester import Test from timtest import Driver, Msg import getopt import mailbox import random from sets import Set import sys mbox_fmts = {"unix": mailbox.PortableUnixMailbox, "mmdf": mailbox.MmdfMailbox, "mh": mailbox.MHMailbox, "qmail": mailbox.Maildir, } class MboxMsg(Msg): def __init__(self, fp, path, index): self.guts = fp.read() self.tag = "%s:%s %s" % (path, index, subject(self.guts)) class mbox(object): def __init__(self, path, indices=None): self.path = path self.indices = {} self.key = '' if indices is not None: self.key = " %s" % indices[0] for i in indices: self.indices[i] = 1 def __repr__(self): return "" % (self.path, self.key) def __iter__(self): # Use a simple factory that just produces a string. mbox = mbox_fmts[FMT](open(self.path, "rb"), lambda f: MboxMsg(f, self.path, i)) i = 0 while 1: msg = mbox.next() if msg is None: return i += 1 if self.indices.get(i-1) or not self.indices: yield msg def subject(buf): buf = buf.lower() i = buf.find('subject:') j = buf.find("\n", i) return buf[i:j] def randindices(nelts, nresults): L = range(nelts) random.shuffle(L) chunk = nelts / nresults for i in range(nresults): yield Set(L[:chunk]) del L[:chunk] def sort(seq): L = list(seq) L.sort() return L def main(args): global FMT FMT = "unix" NSETS = 5 SEED = 101 LIMIT = None opts, args = getopt.getopt(args, "f:n:s:l:") for k, v in opts: if k == '-f': FMT = v if k == '-n': NSETS = int(v) if k == '-s': SEED = int(v) if k == '-l': LIMIT = int(v) ham, spam = args random.seed(SEED) nham = len(list(mbox(ham))) nspam = len(list(mbox(spam))) if LIMIT: nham = min(nham, LIMIT) nspam = min(nspam, LIMIT) print "ham", ham, nham print "spam", spam, nspam testsets = [] for iham in randindices(nham, NSETS): for ispam in randindices(nspam, NSETS): testsets.append((sort(iham), sort(ispam))) driver = Driver() for iham, ispam in testsets: driver.train(mbox(ham, iham), mbox(spam, ispam)) for ihtest, istest in testsets: if (iham, ispam) == (ihtest, istest): continue driver.test(mbox(ham, ihtest), mbox(spam, istest)) driver.finish() driver.alldone() if __name__ == "__main__": sys.exit(main(sys.argv[1:])) Index: timtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timtest.py,v retrieving revision 1.7 retrieving revision 1.8 diff -C2 -d -r1.7 -r1.8 *** timtest.py 6 Sep 2002 19:12:59 -0000 1.7 --- timtest.py 6 Sep 2002 19:26:34 -0000 1.8 *************** *** 59,63 **** def __init__(self, dir, name): path = dir + "/" + name ! self.path = path f = open(path, 'rb') guts = f.read() --- 59,63 ---- def __init__(self, dir, name): path = dir + "/" + name ! self.tag = path f = open(path, 'rb') guts = f.read() *************** *** 69,76 **** def __hash__(self): ! return hash(self.path) def __eq__(self, other): ! return self.path == other.path class MsgStream(object): --- 69,76 ---- def __hash__(self): ! return hash(self.tag) def __eq__(self, other): ! return self.tag == other.tag class MsgStream(object): *************** *** 86,89 **** --- 86,198 ---- return self.produce() + class Driver: + + def __init__(self): + self.nbuckets = 40 + self.falsepos = Set() + self.falseneg = Set() + self.global_ham_hist = Hist(self.nbuckets) + self.global_spam_hist = Hist(self.nbuckets) + + def train(self, ham, spam): + self.classifier = classifier.GrahamBayes() + self.tester = Tester.Test(self.classifier) + print "Training on", ham, "&", spam, "..." + self.tester.train(ham, spam) + + self.trained_ham_hist = Hist(self.nbuckets) + self.trained_spam_hist = Hist(self.nbuckets) + + def finish(self): + printhist("all in this set:", + self.trained_ham_hist, self.trained_spam_hist) + self.global_ham_hist += self.trained_ham_hist + self.global_spam_hist += self.trained_spam_hist + + def alldone(self): + printhist("all runs:", self.global_ham_hist, self.global_spam_hist) + + def test(self, ham, spam): + c = self.classifier + t = self.tester + local_ham_hist = Hist(self.nbuckets) + local_spam_hist = Hist(self.nbuckets) + + def new_ham(msg, prob): + local_ham_hist.add(prob) + + def new_spam(msg, prob): + local_spam_hist.add(prob) + if prob < 0.1: + print + print "Low prob spam!", prob + print msg.tag + prob, clues = c.spamprob(msg, True) + for clue in clues: + print "prob(%r) = %g" % clue + print + print msg.guts + + t.reset_test_results() + print " testing against", ham, "&", spam, "...", + t.predict(spam, True, new_spam) + t.predict(ham, False, new_ham) + print t.nham_tested, "hams &", t.nspam_tested, "spams" + + print " false positive:", t.false_positive_rate() + print " false negative:", t.false_negative_rate() + + newfpos = Set(t.false_positives()) - self.falsepos + self.falsepos |= newfpos + print " new false positives:", [e.tag for e in newfpos] + for e in newfpos: + print '*' * 78 + print e.tag + prob, clues = c.spamprob(e, True) + print "prob =", prob + for clue in clues: + print "prob(%r) = %g" % clue + print + print e.guts + + newfneg = Set(t.false_negatives()) - self.falseneg + self.falseneg |= newfneg + print " new false negatives:", [e.tag for e in newfneg] + for e in []:#newfneg: + print '*' * 78 + print e.tag + prob, clues = c.spamprob(e, True) + print "prob =", prob + for clue in clues: + print "prob(%r) = %g" % clue + print + print e.guts[:1000] + + print + print " best discriminators:" + stats = [(r.killcount, w) for w, r in c.wordinfo.iteritems()] + stats.sort() + del stats[:-30] + for count, w in stats: + r = c.wordinfo[w] + print " %r %d %g" % (w, r.killcount, r.spamprob) + + + printhist("this pair:", local_ham_hist, local_spam_hist) + + self.trained_ham_hist += local_ham_hist + self.trained_spam_hist += local_spam_hist + + def jdrive(): + d = Driver() + + for spamdir, hamdir in SPAMHAMDIRS: + d.train(MsgStream(hamdir), MsgStream(spamdir)) + for sd2, hd2 in SPAMHAMDIRS: + if (sd2, hd2) == (spamdir, hamdir): + continue + d.test(MsgStream(hd2), MsgStream(sd2)) + d.finish() + d.alldone() def drive(): *************** *** 185,187 **** printhist("all runs:", global_ham_hist, global_spam_hist) ! drive() --- 294,297 ---- printhist("all runs:", global_ham_hist, global_spam_hist) ! if __name__ == "__main__": ! drive() From rubiconx@users.sourceforge.net Fri Sep 6 20:29:58 2002 From: rubiconx@users.sourceforge.net (Neale Pickett) Date: Fri, 06 Sep 2002 12:29:58 -0700 Subject: [Spambayes-checkins] spambayes hammie.py,NONE,1.1 classifier.py,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv16269 Modified Files: classifier.py Added Files: hammie.py Log Message: First stab at a procmail-ready application of all this great code. The dbm method doesn't work for me yet, but you can use it as-is with the pickle method, just invoke it from procmail with the -f option. I had to make a minor change to the classifier so it would write back modified values to the database. I suppose I could have done this by subclassing WordInfo's __setattr__ with a callback to the containing PersistentGrahamBayes class, but this way is cleaner and should incur no negligible penalty for the original GrahamBayes class. I hope this is okay with Tim :^) --- NEW FILE: hammie.py --- #! /usr/bin/env python # A driver for the classifier module. Currently mostly a wrapper around # existing stuff. """Usage: %(program)s [options] Where: -h show usage and exit -g PATH mbox or directory of known good messages (non-spam) -s PATH mbox or directory of known spam messages -p FILE use file as the persistent store. loads data from this file if it exists, and saves data to this file at the end. Default: hammie.db -d use the DBM store instead of cPickle. The file is larger and creating it is slower, but checking against it is much faster, especially for large word databases. -f run as a filter: read a single message from stdin, add an X-Spam-Disposition header, and write it to stdout. """ import sys import os import stat import getopt import mailbox import email import classifier import errno import anydbm import cPickle as pickle program = sys.argv[0] # Tim's tokenizer kicks far more booty than anything I would have # written. Score one for analysis ;) from timtoken import tokenize class DBDict: """Database Dictionary This wraps an anydbm to make it look even more like a dictionary. Call it with the name of your database file. Optionally, you can specify a list of keys to skip when iterating. This only affects iterators; things like .keys() still list everything. For instance: >>> d = DBDict('/tmp/goober.db', ('skipme', 'skipmetoo')) >>> d['skipme'] = 'booga' >>> d['countme'] = 'wakka' >>> print d.keys() ['skipme', 'countme'] >>> for k in d.iterkeys(): ... print k countme """ def __init__(self, dbname, iterskip=()): self.hash = anydbm.open(dbname, 'c') self.iterskip = iterskip def __getitem__(self, key): if self.hash.has_key(key): return pickle.loads(self.hash[key]) else: raise KeyError(key) def __setitem__(self, key, val): v = pickle.dumps(val, 1) self.hash[key] = v def __delitem__(self, key, val): del(self.hash[key]) def __iter__(self, fn=None): k = self.hash.first() while k != None: key = k[0] val = pickle.loads(k[1]) if key not in self.iterskip: if fn: yield fn((key, val)) else: yield (key, val) try: k = self.hash.next() except KeyError: break def __contains__(self, name): return self.has_key(name) def __getattr__(self, name): # Pass the buck return getattr(self.hash, name) def get(self, key, dfl=None): if self.has_key(key): return self[key] else: return dfl def iteritems(self): return self.__iter__() def iterkeys(self): return self.__iter__(lambda k: k[0]) def itervalues(self): return self.__iter__(lambda k: k[1]) class PersistentGrahamBayes(classifier.GrahamBayes): """A persistent GrahamBayes classifier This is just like classifier.GrahamBayes, except that the dictionary is a database. You take less disk this way, I think, and you can pretend it's persistent. It's much slower training, but much faster checking, and takes less memory all around. On destruction, an instantiation of this class will write it's state to a special key. When you instantiate a new one, it will attempt to read these values out of that key again, so you can pick up where you left off. """ # XXX: Would it be even faster to remember (in a list) which keys # had been modified, and only recalculate those keys? No sense in # going over the entire word database if only 100 words are # affected. # XXX: Another idea: cache stuff in memory. But by then maybe we # should just use ZODB. def __init__(self, dbname): classifier.GrahamBayes.__init__(self) self.statekey = "saved state" self.wordinfo = DBDict(dbname, (self.statekey,)) self.restore_state() def __del__(self): #super.__del__(self) self.save_state() def save_state(self): self.wordinfo[self.statekey] = (self.nham, self.nspam) def restore_state(self): if self.wordinfo.has_key(self.statekey): self.nham, self.nspam = self.wordinfo[self.statekey] def train(bayes, msgs, is_spam): """Train bayes with a message""" def _factory(fp): try: return email.message_from_file(fp) except email.Errors.MessageParseError: return '' if stat.S_ISDIR(os.stat(msgs)[stat.ST_MODE]): mbox = mailbox.MHMailbox(msgs, _factory) else: fp = open(msgs) mbox = mailbox.PortableUnixMailbox(fp, _factory) i = 0 for msg in mbox: i += 1 # XXX: Is the \r a Unixism? I seem to recall it working in DOS # back in the day. Maybe it's a line-printer-ism ;) sys.stdout.write("\r%6d" % i) sys.stdout.flush() bayes.learn(tokenize(str(msg)), is_spam, False) print def filter(bayes, input, output): """Filter (judge) a message""" msg = email.message_from_file(input) prob, clues = bayes.spamprob(tokenize(str(msg)), True) if prob < 0.9: disp = "No" else: disp = "Yes" disp += "; %.2f" % prob disp += "; " + "; ".join(map(lambda x: "%s: %.2f" % (`x[0]`, x[1]), clues)) msg.add_header("X-Spam-Disposition", disp) output.write(str(msg)) def usage(code, msg=''): if msg: print >> sys.stderr, msg print >> sys.stderr print >> sys.stderr, __doc__ % globals() sys.exit(code) def main(): try: opts, args = getopt.getopt(sys.argv[1:], 'hdfg:s:p:') except getopt.error, msg: usage(1, msg) if not opts: usage(0, "No options given") pck = "hammie.db" good = spam = None do_filter = usedb = False for opt, arg in opts: if opt == '-h': usage(0) elif opt == '-g': good = arg elif opt == '-s': spam = arg elif opt == '-p': pck = arg elif opt == "-d": usedb = True elif opt == "-f": do_filter = True if args: usage(1) save = False if usedb: bayes = PersistentGrahamBayes(pck) else: bayes = None try: fp = open(pck, 'rb') except IOError, e: if e.errno <> errno.ENOENT: raise else: bayes = pickle.load(fp) fp.close() if bayes is None: bayes = classifier.GrahamBayes() if good: print "Training ham:" train(bayes, good, False) save = True if spam: print "Training spam:" train(bayes, spam, True) save = True if save: bayes.update_probabilities() if not usedb and pck: fp = open(pck, 'wb') pickle.dump(bayes, fp, 1) fp.close() if do_filter: filter(bayes, sys.stdin, sys.stdout) if __name__ == "__main__": main() Index: classifier.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/classifier.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** classifier.py 5 Sep 2002 16:16:43 -0000 1.1 --- classifier.py 6 Sep 2002 19:29:56 -0000 1.2 *************** *** 473,477 **** nham = float(self.nham or 1) nspam = float(self.nspam or 1) ! for record in self.wordinfo.itervalues(): # Compute prob(msg is spam | msg contains word). hamcount = HAMBIAS * record.hamcount --- 473,477 ---- nham = float(self.nham or 1) nspam = float(self.nspam or 1) ! for word,record in self.wordinfo.iteritems(): # Compute prob(msg is spam | msg contains word). hamcount = HAMBIAS * record.hamcount *************** *** 488,493 **** elif prob > MAX_SPAMPROB: prob = MAX_SPAMPROB ! ! record.spamprob = prob if self.DEBUG: --- 488,494 ---- elif prob > MAX_SPAMPROB: prob = MAX_SPAMPROB ! if record.spamprob != prob: ! record.spamprob = prob ! self.wordinfo[word] = record if self.DEBUG: From tim.one@comcast.net Fri Sep 6 21:00:53 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 06 Sep 2002 16:00:53 -0400 Subject: [Spambayes-checkins] spambayes hammie.py,NONE,1.1 classifier.py,1.1,1.2 In-Reply-To: Message-ID: r.py > Added Files: > hammie.py Please add a short blurb about new files to README.txt. From tim.one@comcast.net Fri Sep 6 21:01:20 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 06 Sep 2002 16:01:20 -0400 Subject: [Spambayes-checkins] spambayes mboxtest.py,NONE,1.1 timtest.py,1.7,1.8 In-Reply-To: Message-ID: > Added Files: > mboxtest.py Please add a short blurb about new files to README.txt. From jhylton@users.sourceforge.net Fri Sep 6 21:08:16 2002 From: jhylton@users.sourceforge.net (Jeremy Hylton) Date: Fri, 06 Sep 2002 13:08:16 -0700 Subject: [Spambayes-checkins] spambayes README.txt,1.4,1.5 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv32684 Modified Files: README.txt Log Message: Add entry for mboxtest.py. Index: README.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/README.txt,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** README.txt 6 Sep 2002 17:33:25 -0000 1.4 --- README.txt 6 Sep 2002 20:08:14 -0000 1.5 *************** *** 27,32 **** of false positives and false negatives. timtoken.py ! Am implementation of tokenize() that Tim can't seem to help but keep working on . --- 27,38 ---- of false positives and false negatives. + mboxtest.py + A concrete test driver like timtest.py (see below), but working + with a pair of mailbox files rather than the specialized timtest + setup. Note that the validity of results from mboxtest.py have + yet to be confirmed. + timtoken.py ! An implementation of tokenize() that Tim can't seem to help but keep working on . From gvanrossum@users.sourceforge.net Fri Sep 6 21:12:07 2002 From: gvanrossum@users.sourceforge.net (Guido van Rossum) Date: Fri, 06 Sep 2002 13:12:07 -0700 Subject: [Spambayes-checkins] spambayes hammie.py,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv1852 Modified Files: hammie.py Log Message: Use os.path.isdir() to test for directory-ness. Index: hammie.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammie.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** hammie.py 6 Sep 2002 19:29:56 -0000 1.1 --- hammie.py 6 Sep 2002 20:12:05 -0000 1.2 *************** *** 27,31 **** import sys import os - import stat import getopt import mailbox --- 27,30 ---- *************** *** 167,171 **** return '' ! if stat.S_ISDIR(os.stat(msgs)[stat.ST_MODE]): mbox = mailbox.MHMailbox(msgs, _factory) else: --- 166,170 ---- return '' ! if os.path.isdir(msgs): mbox = mailbox.MHMailbox(msgs, _factory) else: From rubiconx@users.sourceforge.net Fri Sep 6 21:13:34 2002 From: rubiconx@users.sourceforge.net (Neale Pickett) Date: Fri, 06 Sep 2002 13:13:34 -0700 Subject: [Spambayes-checkins] spambayes README.txt,1.5,1.6 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv1676 Modified Files: README.txt Log Message: Add short blurb about hammie.py Index: README.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/README.txt,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** README.txt 6 Sep 2002 20:08:14 -0000 1.5 --- README.txt 6 Sep 2002 20:13:31 -0000 1.6 *************** *** 27,30 **** --- 27,34 ---- of false positives and false negatives. + hammie.py + A spamassassin-like filter which uses timtoken (below) and + classifier (above). Needs to be made faster, especially for writes. + mboxtest.py A concrete test driver like timtest.py (see below), but working From gvanrossum@users.sourceforge.net Fri Sep 6 21:23:18 2002 From: gvanrossum@users.sourceforge.net (Guido van Rossum) Date: Fri, 06 Sep 2002 13:23:18 -0700 Subject: [Spambayes-checkins] spambayes hammie.py,1.2,1.3 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv5367 Modified Files: hammie.py Log Message: Add a hack to train directly on a mailbox full of .txt files, like Bruce Guenter's spam archive at http://www.em.ca/~bruceg/spam/. Index: hammie.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammie.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** hammie.py 6 Sep 2002 20:12:05 -0000 1.2 --- hammie.py 6 Sep 2002 20:23:16 -0000 1.3 *************** *** 25,32 **** --- 25,35 ---- """ + from __future__ import generators + import sys import os import getopt import mailbox + import glob import email import classifier *************** *** 158,161 **** --- 161,182 ---- + class DirOfTxtFileMailbox: + + """Mailbox directory consisting of .txt files.""" + + def __init__(self, dirname, factory): + self.names = glob.glob(os.path.join(dirname, "*.txt")) + self.factory = factory + + def __iter__(self): + for name in self.names: + try: + f = open(name) + except IOError: + continue + yield self.factory(f) + f.close() + + def train(bayes, msgs, is_spam): """Train bayes with a message""" *************** *** 167,171 **** if os.path.isdir(msgs): ! mbox = mailbox.MHMailbox(msgs, _factory) else: fp = open(msgs) --- 188,197 ---- if os.path.isdir(msgs): ! # XXX This is bogus: use an MHMailbox if the pathname contains /Mail/ ! # XXX Should really use '+foo' MH folder styles. Later. ! if msgs.find("/Mail/") >= 0: ! mbox = mailbox.MHMailbox(msgs, _factory) ! else: ! mbox = DirOfTxtFileMailbox(msgs, _factory) else: fp = open(msgs) From gvanrossum@users.sourceforge.net Fri Sep 6 21:42:47 2002 From: gvanrossum@users.sourceforge.net (Guido van Rossum) Date: Fri, 06 Sep 2002 13:42:47 -0700 Subject: [Spambayes-checkins] spambayes hammie.py,1.3,1.4 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv11926 Modified Files: hammie.py Log Message: train(): recognize '+foo' as the name of MH folder 'foo'. Index: hammie.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammie.py,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** hammie.py 6 Sep 2002 20:23:16 -0000 1.3 --- hammie.py 6 Sep 2002 20:42:44 -0000 1.4 *************** *** 187,191 **** return '' ! if os.path.isdir(msgs): # XXX This is bogus: use an MHMailbox if the pathname contains /Mail/ # XXX Should really use '+foo' MH folder styles. Later. --- 187,195 ---- return '' ! if msgs.startswith("+"): ! import mhlib ! mh = mhlib.MH() ! mbox = mailbox.MHMailbox(os.path.join(mh.getpath(), msgs[1:])) ! elif os.path.isdir(msgs): # XXX This is bogus: use an MHMailbox if the pathname contains /Mail/ # XXX Should really use '+foo' MH folder styles. Later. From tim_one@users.sourceforge.net Fri Sep 6 21:42:42 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Fri, 06 Sep 2002 13:42:42 -0700 Subject: [Spambayes-checkins] spambayes timtoken.py,1.2,1.3 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv11651 Modified Files: timtoken.py Log Message: Added a note about an experiment with no lower limit on the length of words we'll look at. Didn't matter to f-p, but hurt f-n. Index: timtoken.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timtoken.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** timtoken.py 6 Sep 2002 19:12:59 -0000 1.2 --- timtoken.py 6 Sep 2002 20:42:40 -0000 1.3 *************** *** 392,395 **** --- 392,397 ---- # XXX runs -- overall, no significant difference. It's only "common # XXX sense" so far driving the exclusion of lengths 1 and 2. + # XXX Later: A test with no lower bound showed a significant increase + # XXX in the f-n rate. Curious! # Make sure this range matches in tokenize(). From gvanrossum@users.sourceforge.net Fri Sep 6 21:48:32 2002 From: gvanrossum@users.sourceforge.net (Guido van Rossum) Date: Fri, 06 Sep 2002 13:48:32 -0700 Subject: [Spambayes-checkins] spambayes hammie.py,1.4,1.5 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv13867 Modified Files: hammie.py Log Message: Fix comments. Index: hammie.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammie.py,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** hammie.py 6 Sep 2002 20:42:44 -0000 1.4 --- hammie.py 6 Sep 2002 20:48:29 -0000 1.5 *************** *** 180,184 **** def train(bayes, msgs, is_spam): ! """Train bayes with a message""" def _factory(fp): try: --- 180,184 ---- def train(bayes, msgs, is_spam): ! """Train bayes with all messages from a mailbox.""" def _factory(fp): try: *************** *** 192,197 **** mbox = mailbox.MHMailbox(os.path.join(mh.getpath(), msgs[1:])) elif os.path.isdir(msgs): ! # XXX This is bogus: use an MHMailbox if the pathname contains /Mail/ ! # XXX Should really use '+foo' MH folder styles. Later. if msgs.find("/Mail/") >= 0: mbox = mailbox.MHMailbox(msgs, _factory) --- 192,197 ---- mbox = mailbox.MHMailbox(os.path.join(mh.getpath(), msgs[1:])) elif os.path.isdir(msgs): ! # XXX Bogus: use an MHMailbox if the pathname contains /Mail/, ! # else a DirOfTxtFileMailbox. if msgs.find("/Mail/") >= 0: mbox = mailbox.MHMailbox(msgs, _factory) From tim_one@users.sourceforge.net Fri Sep 6 23:47:50 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Fri, 06 Sep 2002 15:47:50 -0700 Subject: [Spambayes-checkins] spambayes timtest.py,1.8,1.9 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv19153 Modified Files: timtest.py Log Message: Moved this along toward being more pluggable. Nuked the drive() function and renamed Jeremy's jdrive() to drive(). Factored out code for displaying a msg. Repaired some output so that rates.py can find the output it's looking for. Sped the determination of the best discriminators via using an nbest heap instead of materializing the whole wordinfo dict into a list and sorting it. Index: timtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timtest.py,v retrieving revision 1.8 retrieving revision 1.9 diff -C2 -d -r1.8 -r1.9 *** timtest.py 6 Sep 2002 19:26:34 -0000 1.8 --- timtest.py 6 Sep 2002 22:47:48 -0000 1.9 *************** *** 9,12 **** --- 9,13 ---- from sets import Set import cPickle as pickle + from heapq import heapreplace import Tester *************** *** 56,59 **** --- 57,71 ---- spam.display() + def printmsg(msg, prob, clues, charlimit=None): + print msg.tag + print "prob =", prob + for clue in clues: + print "prob(%r) = %g" % clue + print + guts = msg.guts + if charlimit is not None: + guts = guts[:charlimit] + print guts + class Msg(object): def __init__(self, dir, name): *************** *** 78,81 **** --- 90,96 ---- self.directory = directory + def __str__(self): + return self.directory + def produce(self): directory = self.directory *************** *** 86,93 **** return self.produce() class Driver: ! def __init__(self): ! self.nbuckets = 40 self.falsepos = Set() self.falseneg = Set() --- 101,116 ---- return self.produce() + + # Loop: + # train() # on ham and spam + # Loop: + # test() # on presumably new ham and spam + # finishtest() # display stats against all runs on training set + # alldone() # display stats against all runs + class Driver: ! def __init__(self, nbuckets=40): ! self.nbuckets = nbuckets self.falsepos = Set() self.falseneg = Set() *************** *** 97,109 **** def train(self, ham, spam): self.classifier = classifier.GrahamBayes() ! self.tester = Tester.Test(self.classifier) ! print "Training on", ham, "&", spam, "..." ! self.tester.train(ham, spam) self.trained_ham_hist = Hist(self.nbuckets) self.trained_spam_hist = Hist(self.nbuckets) ! def finish(self): ! printhist("all in this set:", self.trained_ham_hist, self.trained_spam_hist) self.global_ham_hist += self.trained_ham_hist --- 120,134 ---- def train(self, ham, spam): self.classifier = classifier.GrahamBayes() ! t = self.tester = Tester.Test(self.classifier) ! ! print "Training on", ham, "&", spam, "...", ! t.train(ham, spam) ! print t.nham, "hams &", t.nspam, "spams" self.trained_ham_hist = Hist(self.nbuckets) self.trained_spam_hist = Hist(self.nbuckets) ! def finishtest(self): ! printhist("all in this training set:", self.trained_ham_hist, self.trained_spam_hist) self.global_ham_hist += self.trained_ham_hist *************** *** 127,136 **** print print "Low prob spam!", prob - print msg.tag prob, clues = c.spamprob(msg, True) ! for clue in clues: ! print "prob(%r) = %g" % clue ! print ! print msg.guts t.reset_test_results() --- 152,157 ---- print print "Low prob spam!", prob prob, clues = c.spamprob(msg, True) ! printmsg(msg, prob, clues) t.reset_test_results() *************** *** 148,158 **** for e in newfpos: print '*' * 78 - print e.tag prob, clues = c.spamprob(e, True) ! print "prob =", prob ! for clue in clues: ! print "prob(%r) = %g" % clue ! print ! print e.guts newfneg = Set(t.false_negatives()) - self.falseneg --- 169,174 ---- for e in newfpos: print '*' * 78 prob, clues = c.spamprob(e, True) ! printmsg(e, prob, clues) newfneg = Set(t.false_negatives()) - self.falseneg *************** *** 161,188 **** for e in []:#newfneg: print '*' * 78 - print e.tag prob, clues = c.spamprob(e, True) ! print "prob =", prob ! for clue in clues: ! print "prob(%r) = %g" % clue ! print ! print e.guts[:1000] print print " best discriminators:" ! stats = [(r.killcount, w) for w, r in c.wordinfo.iteritems()] stats.sort() - del stats[:-30] for count, w in stats: r = c.wordinfo[w] print " %r %d %g" % (w, r.killcount, r.spamprob) - printhist("this pair:", local_ham_hist, local_spam_hist) - self.trained_ham_hist += local_ham_hist self.trained_spam_hist += local_spam_hist ! def jdrive(): d = Driver() --- 177,203 ---- for e in []:#newfneg: print '*' * 78 prob, clues = c.spamprob(e, True) ! printmsg(e, prob, clues, 1000) print print " best discriminators:" ! stats = [(-1, None) for i in range(30)] ! smallest_killcount = -1 ! for w, r in c.wordinfo.iteritems(): ! if r.killcount > smallest_killcount: ! heapreplace(stats, (r.killcount, w)) ! smallest_killcount = stats[0][0] stats.sort() for count, w in stats: + if count < 0: + continue r = c.wordinfo[w] print " %r %d %g" % (w, r.killcount, r.spamprob) printhist("this pair:", local_ham_hist, local_spam_hist) self.trained_ham_hist += local_ham_hist self.trained_spam_hist += local_spam_hist ! def drive(): d = Driver() *************** *** 193,296 **** continue d.test(MsgStream(hd2), MsgStream(sd2)) ! d.finish() d.alldone() - - def drive(): - nbuckets = 40 - falsepos = Set() - falseneg = Set() - global_ham_hist = Hist(nbuckets) - global_spam_hist = Hist(nbuckets) - for spamdir, hamdir in SPAMHAMDIRS: - c = classifier.GrahamBayes() - t = Tester.Test(c) - print "Training on", hamdir, "&", spamdir, "...", - t.train(MsgStream(hamdir), MsgStream(spamdir)) - print t.nham, "hams &", t.nspam, "spams" - - trained_ham_hist = Hist(nbuckets) - trained_spam_hist = Hist(nbuckets) - - fp = file('w.pik', 'wb') - pickle.dump(c, fp, 1) - fp.close() - - for sd2, hd2 in SPAMHAMDIRS: - if (sd2, hd2) == (spamdir, hamdir): - continue - - local_ham_hist = Hist(nbuckets) - local_spam_hist = Hist(nbuckets) - - def new_ham(msg, prob): - local_ham_hist.add(prob) - - def new_spam(msg, prob): - local_spam_hist.add(prob) - if prob < 0.1: - print - print "Low prob spam!", prob - print msg.path - prob, clues = c.spamprob(msg, True) - for clue in clues: - print "prob(%r) = %g" % clue - print - print msg.guts - - t.reset_test_results() - print " testing against", hd2, "&", sd2, "...", - t.predict(MsgStream(sd2), True, new_spam) - t.predict(MsgStream(hd2), False, new_ham) - print t.nham_tested, "hams &", t.nspam_tested, "spams" - - print " false positive:", t.false_positive_rate() - print " false negative:", t.false_negative_rate() - - newfpos = Set(t.false_positives()) - falsepos - falsepos |= newfpos - print " new false positives:", [e.path for e in newfpos] - for e in newfpos: - print '*' * 78 - print e.path - prob, clues = c.spamprob(e, True) - print "prob =", prob - for clue in clues: - print "prob(%r) = %g" % clue - print - print e.guts - - newfneg = Set(t.false_negatives()) - falseneg - falseneg |= newfneg - print " new false negatives:", [e.path for e in newfneg] - for e in []:#newfneg: - print '*' * 78 - print e.path - prob, clues = c.spamprob(e, True) - print "prob =", prob - for clue in clues: - print "prob(%r) = %g" % clue - print - print e.guts[:1000] - - print - print " best discriminators:" - stats = [(r.killcount, w) for w, r in c.wordinfo.iteritems()] - stats.sort() - del stats[:-30] - for count, w in stats: - r = c.wordinfo[w] - print " %r %d %g" % (w, r.killcount, r.spamprob) - - - printhist("this pair:", local_ham_hist, local_spam_hist) - - trained_ham_hist += local_ham_hist - trained_spam_hist += local_spam_hist - - printhist("all in this set:", trained_ham_hist, trained_spam_hist) - global_ham_hist += trained_ham_hist - global_spam_hist += trained_spam_hist - - printhist("all runs:", global_ham_hist, global_spam_hist) if __name__ == "__main__": --- 208,213 ---- continue d.test(MsgStream(hd2), MsgStream(sd2)) ! d.finishtest() d.alldone() if __name__ == "__main__": From rubiconx@users.sourceforge.net Fri Sep 6 23:53:51 2002 From: rubiconx@users.sourceforge.net (Neale Pickett) Date: Fri, 06 Sep 2002 15:53:51 -0700 Subject: [Spambayes-checkins] spambayes classifier.py,1.2,1.3 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv22277 Modified Files: classifier.py Log Message: Another hack to get classifier to work with the database back-end. This makes hammie work with the -d option. Index: classifier.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/classifier.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** classifier.py 6 Sep 2002 19:29:56 -0000 1.2 --- classifier.py 6 Sep 2002 22:53:49 -0000 1.3 *************** *** 538,541 **** --- 538,542 ---- else: record.hamcount += 1 + wordinfo[word] = record if self.DEBUG: From tim_one@users.sourceforge.net Sat Sep 7 01:31:58 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Fri, 06 Sep 2002 17:31:58 -0700 Subject: [Spambayes-checkins] spambayes timtest.py,1.9,1.10 timtoken.py,1.3,1.4 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv13865 Modified Files: timtest.py timtoken.py Log Message: Added note about boosting the lower limit on word length to 4. Index: timtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timtest.py,v retrieving revision 1.9 retrieving revision 1.10 diff -C2 -d -r1.9 -r1.10 *** timtest.py 6 Sep 2002 22:47:48 -0000 1.9 --- timtest.py 7 Sep 2002 00:31:56 -0000 1.10 *************** *** 129,132 **** --- 129,138 ---- self.trained_spam_hist = Hist(self.nbuckets) + #f = file('w.pik', 'wb') + #pickle.dump(self.classifier, f, 1) + #f.close() + #import sys + #sys.exit(0) + def finishtest(self): printhist("all in this training set:", Index: timtoken.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timtoken.py,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** timtoken.py 6 Sep 2002 20:42:40 -0000 1.3 --- timtoken.py 7 Sep 2002 00:31:56 -0000 1.4 *************** *** 394,397 **** --- 394,399 ---- # XXX Later: A test with no lower bound showed a significant increase # XXX in the f-n rate. Curious! + # XXX Later: Boosting the lower bound to 4 is a Bad Idea too: f-p and + # XXX f-n rates both suffered then. # Make sure this range matches in tokenize(). From tim_one@users.sourceforge.net Sat Sep 7 02:39:57 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Fri, 06 Sep 2002 18:39:57 -0700 Subject: [Spambayes-checkins] spambayes timtoken.py,1.4,1.5 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv27364 Modified Files: timtoken.py Log Message: Comments about how long a word should be; the current values are the best. Index: timtoken.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timtoken.py,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** timtoken.py 7 Sep 2002 00:31:56 -0000 1.4 --- timtoken.py 7 Sep 2002 01:39:55 -0000 1.5 *************** *** 356,359 **** --- 356,379 ---- # XXX not to strip HTML from HTML-only msgs should be revisited. + ############################################################################## + # How big should "a word" be? + # + # As I write this, words less than 3 chars are ignored completely, and words + # with more than 12 are special-cased, replaced with a summary "I skipped + # about so-and-so many chars starting with such-and-such a letter" token. + # This makes sense for English if most of the info is in "regular size" + # words. + # + # A test run boosting to 13 had no effect on f-p rate, and did a little + # better or worse than 12 across runs -- overall, no significant difference. + # The database size is smaller at 12, so there's nothing in favor of 13. + # A test at 11 showed a slight but consistent bad effect on the f-n rate + # (lost 12 times, won once, tied 7 times). + # + # A test with no lower bound showed a significant increase in the f-n rate. + # Curious, but not worth digging into. Boosting the lower bound to 4 is a + # worse idea: f-p and f-n rates both suffered significantly then. I didn't + # try testing with lower bound 2. + url_re = re.compile(r""" (https? | ftp) # capture the protocol *************** *** 386,399 **** def tokenize_word(word, _len=len): n = _len(word) - - # XXX How big should "a word" be? - # XXX I expect 12 is fine -- a test run boosting to 13 had no effect - # XXX on f-p rate, and did a little better or worse than 12 across - # XXX runs -- overall, no significant difference. It's only "common - # XXX sense" so far driving the exclusion of lengths 1 and 2. - # XXX Later: A test with no lower bound showed a significant increase - # XXX in the f-n rate. Curious! - # XXX Later: Boosting the lower bound to 4 is a Bad Idea too: f-p and - # XXX f-n rates both suffered then. # Make sure this range matches in tokenize(). --- 406,409 ---- From tim_one@users.sourceforge.net Sat Sep 7 02:41:30 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Fri, 06 Sep 2002 18:41:30 -0700 Subject: [Spambayes-checkins] spambayes timtoken.py,1.5,1.6 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv27775 Modified Files: timtoken.py Log Message: Fixed typo in comment. Index: timtoken.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timtoken.py,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** timtoken.py 7 Sep 2002 01:39:55 -0000 1.5 --- timtoken.py 7 Sep 2002 01:41:28 -0000 1.6 *************** *** 467,471 **** # # A bug in this code prevented Content-Transfer-Encoding from getting ! # picked up. Fixing that bug showed that it didn't helpe, so the corrected # code is disabled now (left column without Content-Transfer-Encoding, # right column with it); --- 467,471 ---- # # A bug in this code prevented Content-Transfer-Encoding from getting ! # picked up. Fixing that bug showed that it didn't help, so the corrected # code is disabled now (left column without Content-Transfer-Encoding, # right column with it); From gvanrossum@users.sourceforge.net Sat Sep 7 05:20:45 2002 From: gvanrossum@users.sourceforge.net (Guido van Rossum) Date: Fri, 06 Sep 2002 21:20:45 -0700 Subject: [Spambayes-checkins] spambayes hammie.py,1.5,1.6 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv4871 Modified Files: hammie.py Log Message: Fixed a bug in the opening of a folder given with "+foo" (wasn't using _factory). Add a -u option similar to that of GBayes.py. For this, factored the opening of the mbox out of train() into a separate function getmbox(), and the formatting of the clues out of filter(). (The -u option needs work; it currently doesn't report the message number in a very useful way.) Index: hammie.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammie.py,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** hammie.py 6 Sep 2002 20:48:29 -0000 1.5 --- hammie.py 7 Sep 2002 04:20:43 -0000 1.6 *************** *** 10,16 **** show usage and exit -g PATH ! mbox or directory of known good messages (non-spam) -s PATH ! mbox or directory of known spam messages -p FILE use file as the persistent store. loads data from this file if it --- 10,18 ---- show usage and exit -g PATH ! mbox or directory of known good messages (non-spam) to train on. -s PATH ! mbox or directory of known spam messages to train on. ! -u PATH ! mbox of unknown messages. A ham/spam decision is reported for each. -p FILE use file as the persistent store. loads data from this file if it *************** *** 179,184 **** ! def train(bayes, msgs, is_spam): ! """Train bayes with all messages from a mailbox.""" def _factory(fp): try: --- 181,186 ---- ! def getmbox(msgs): ! """Return an iterable mbox object given a file/directory/folder name.""" def _factory(fp): try: *************** *** 190,194 **** import mhlib mh = mhlib.MH() ! mbox = mailbox.MHMailbox(os.path.join(mh.getpath(), msgs[1:])) elif os.path.isdir(msgs): # XXX Bogus: use an MHMailbox if the pathname contains /Mail/, --- 192,197 ---- import mhlib mh = mhlib.MH() ! mbox = mailbox.MHMailbox(os.path.join(mh.getpath(), msgs[1:]), ! _factory) elif os.path.isdir(msgs): # XXX Bogus: use an MHMailbox if the pathname contains /Mail/, *************** *** 201,205 **** --- 204,212 ---- fp = open(msgs) mbox = mailbox.PortableUnixMailbox(fp, _factory) + return mbox + def train(bayes, msgs, is_spam): + """Train bayes with all messages from a mailbox.""" + mbox = getmbox(msgs) i = 0 for msg in mbox: *************** *** 212,215 **** --- 219,227 ---- print + def formatclues(clues, sep="; "): + """Format the clues into something readable.""" + # XXX Maybe sort by prob first? + return sep.join(["%r: %.2f" % (word, prob) for word, prob in clues]) + def filter(bayes, input, output): """Filter (judge) a message""" *************** *** 221,228 **** disp = "Yes" disp += "; %.2f" % prob ! disp += "; " + "; ".join(map(lambda x: "%s: %.2f" % (`x[0]`, x[1]), clues)) msg.add_header("X-Spam-Disposition", disp) output.write(str(msg)) def usage(code, msg=''): if msg: --- 233,259 ---- disp = "Yes" disp += "; %.2f" % prob ! disp += "; " + formatclues(clues) msg.add_header("X-Spam-Disposition", disp) output.write(str(msg)) + def score(bayes, msgs): + """Score (judge) all messages from a mailbox.""" + # XXX The reporting needs work! + mbox = getmbox(msgs) + i = 0 + spams = hams = 0 + for msg in mbox: + i += 1 + prob, clues = bayes.spamprob(tokenize(str(msg)), True) + isspam = prob >= 0.9 + print "%6d %4.2f %1s" % (i, prob, isspam and "S" or "."), + if isspam: + spams += 1 + print formatclues(clues) + else: + hams += 1 + print + print "Total %d spam, %d ham" % (spams, hams) + def usage(code, msg=''): if msg: *************** *** 234,238 **** def main(): try: ! opts, args = getopt.getopt(sys.argv[1:], 'hdfg:s:p:') except getopt.error, msg: usage(1, msg) --- 265,269 ---- def main(): try: ! opts, args = getopt.getopt(sys.argv[1:], 'hdfg:s:p:u:') except getopt.error, msg: usage(1, msg) *************** *** 242,246 **** pck = "hammie.db" ! good = spam = None do_filter = usedb = False for opt, arg in opts: --- 273,277 ---- pck = "hammie.db" ! good = spam = unknown = None do_filter = usedb = False for opt, arg in opts: *************** *** 257,260 **** --- 288,293 ---- elif opt == "-f": do_filter = True + elif opt == '-u': + unknown = arg if args: usage(1) *************** *** 294,297 **** --- 327,333 ---- if do_filter: filter(bayes, sys.stdin, sys.stdout) + + if unknown: + score(bayes, unknown) if __name__ == "__main__": From gvanrossum@users.sourceforge.net Sat Sep 7 05:23:18 2002 From: gvanrossum@users.sourceforge.net (Guido van Rossum) Date: Fri, 06 Sep 2002 21:23:18 -0700 Subject: [Spambayes-checkins] spambayes hammie.py,1.6,1.7 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv5312 Modified Files: hammie.py Log Message: Sort the clues before formatting. I definitely like this better. Index: hammie.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammie.py,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** hammie.py 7 Sep 2002 04:20:43 -0000 1.6 --- hammie.py 7 Sep 2002 04:23:15 -0000 1.7 *************** *** 221,226 **** def formatclues(clues, sep="; "): """Format the clues into something readable.""" ! # XXX Maybe sort by prob first? ! return sep.join(["%r: %.2f" % (word, prob) for word, prob in clues]) def filter(bayes, input, output): --- 221,227 ---- def formatclues(clues, sep="; "): """Format the clues into something readable.""" ! lst = [(prob, word) for word, prob in clues] ! lst.sort() ! return sep.join(["%r: %.2f" % (word, prob) for prob, word in lst]) def filter(bayes, input, output): From gvanrossum@users.sourceforge.net Sat Sep 7 05:28:16 2002 From: gvanrossum@users.sourceforge.net (Guido van Rossum) Date: Fri, 06 Sep 2002 21:28:16 -0700 Subject: [Spambayes-checkins] spambayes README.txt,1.6,1.7 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv6070 Modified Files: README.txt Log Message: Add a clue about the Python version. Index: README.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/README.txt,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** README.txt 6 Sep 2002 20:13:31 -0000 1.6 --- README.txt 7 Sep 2002 04:28:13 -0000 1.7 *************** *** 16,19 **** --- 16,22 ---- negative rate is still over 1%. + The code here depends in various ways on the latest Python from CVS + (a.k.a. Python 2.3a0 :-). + Primary Files From gvanrossum@users.sourceforge.net Sat Sep 7 05:31:10 2002 From: gvanrossum@users.sourceforge.net (Guido van Rossum) Date: Fri, 06 Sep 2002 21:31:10 -0700 Subject: [Spambayes-checkins] spambayes hammie.py,1.7,1.8 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv6491 Modified Files: hammie.py Log Message: Minor cleanup; standardize exit codes; add some docs/comments. Index: hammie.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammie.py,v retrieving revision 1.7 retrieving revision 1.8 diff -C2 -d -r1.7 -r1.8 *** hammie.py 7 Sep 2002 04:23:15 -0000 1.7 --- hammie.py 7 Sep 2002 04:31:08 -0000 1.8 *************** *** 1,3 **** --- 1,4 ---- #! /usr/bin/env python + # At the moment, this requires Python 2.3 from CVS # A driver for the classifier module. Currently mostly a wrapper around *************** *** 27,32 **** """ - from __future__ import generators - import sys import os --- 28,31 ---- *************** *** 40,44 **** import cPickle as pickle ! program = sys.argv[0] # Tim's tokenizer kicks far more booty than anything I would have --- 39,43 ---- import cPickle as pickle ! program = sys.argv[0] # For usage(); referenced by docstring above # Tim's tokenizer kicks far more booty than anything I would have *************** *** 258,261 **** --- 257,261 ---- def usage(code, msg=''): + """Print usage message and sys.exit(code).""" if msg: print >> sys.stderr, msg *************** *** 265,275 **** def main(): try: opts, args = getopt.getopt(sys.argv[1:], 'hdfg:s:p:u:') except getopt.error, msg: ! usage(1, msg) if not opts: ! usage(0, "No options given") pck = "hammie.db" --- 265,276 ---- def main(): + """Main program; parse options and go.""" try: opts, args = getopt.getopt(sys.argv[1:], 'hdfg:s:p:u:') except getopt.error, msg: ! usage(2, msg) if not opts: ! usage(2, "No options given") pck = "hammie.db" *************** *** 292,296 **** unknown = arg if args: ! usage(1) save = False --- 293,297 ---- unknown = arg if args: ! usage(2, "Positional arguments not allowed") save = False From gvanrossum@users.sourceforge.net Sat Sep 7 05:50:12 2002 From: gvanrossum@users.sourceforge.net (Guido van Rossum) Date: Fri, 06 Sep 2002 21:50:12 -0700 Subject: [Spambayes-checkins] spambayes timtoken.py,1.6,1.7 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv9264 Modified Files: timtoken.py Log Message: Made tokenize() polymorphic. It now accepts an email.Message.Message instance, a file-like object (something with a readline method), or a string (anything else). This is a major speed boost for hammie.py, which has Message objects, but had to convert them to strings before passing to tokenize(), which parsed the string into a Message object again! Index: timtoken.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timtoken.py,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** timtoken.py 7 Sep 2002 01:41:28 -0000 1.6 --- timtoken.py 7 Sep 2002 04:50:10 -0000 1.7 *************** *** 2,6 **** import email - from email import message_from_string from sets import Set --- 2,5 ---- *************** *** 555,566 **** yield 'content-transfer-encoding:' + x.lower() ! def tokenize(string): # Create an email Message object. ! try: ! msg = message_from_string(string) ! except email.Errors.MessageParseError: ! yield 'control: MessageParseError' ! # XXX Fall back to the raw body text? ! return # Special tagging of header lines. --- 554,570 ---- yield 'content-transfer-encoding:' + x.lower() ! def tokenize(obj): # Create an email Message object. ! if isinstance(obj, email.Message.Message): ! msg = obj ! elif hasattr(obj, "readline"): ! msg = email.message_from_file(obj) ! else: ! try: ! msg = email.message_from_string(obj) ! except email.Errors.MessageParseError: ! yield 'control: MessageParseError' ! # XXX Fall back to the raw body text? ! return # Special tagging of header lines. From gvanrossum@users.sourceforge.net Sat Sep 7 05:50:47 2002 From: gvanrossum@users.sourceforge.net (Guido van Rossum) Date: Fri, 06 Sep 2002 21:50:47 -0700 Subject: [Spambayes-checkins] spambayes hammie.py,1.8,1.9 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv9362 Modified Files: hammie.py Log Message: Use the new tokenize(), which accepts our Message objects. Index: hammie.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammie.py,v retrieving revision 1.8 retrieving revision 1.9 diff -C2 -d -r1.8 -r1.9 *** hammie.py 7 Sep 2002 04:31:08 -0000 1.8 --- hammie.py 7 Sep 2002 04:50:45 -0000 1.9 *************** *** 215,219 **** sys.stdout.write("\r%6d" % i) sys.stdout.flush() ! bayes.learn(tokenize(str(msg)), is_spam, False) print --- 215,219 ---- sys.stdout.write("\r%6d" % i) sys.stdout.flush() ! bayes.learn(tokenize(msg), is_spam, False) print *************** *** 227,231 **** """Filter (judge) a message""" msg = email.message_from_file(input) ! prob, clues = bayes.spamprob(tokenize(str(msg)), True) if prob < 0.9: disp = "No" --- 227,231 ---- """Filter (judge) a message""" msg = email.message_from_file(input) ! prob, clues = bayes.spamprob(tokenize(msg), True) if prob < 0.9: disp = "No" *************** *** 245,249 **** for msg in mbox: i += 1 ! prob, clues = bayes.spamprob(tokenize(str(msg)), True) isspam = prob >= 0.9 print "%6d %4.2f %1s" % (i, prob, isspam and "S" or "."), --- 245,249 ---- for msg in mbox: i += 1 ! prob, clues = bayes.spamprob(tokenize(msg), True) isspam = prob >= 0.9 print "%6d %4.2f %1s" % (i, prob, isspam and "S" or "."), From gvanrossum@users.sourceforge.net Sat Sep 7 06:02:58 2002 From: gvanrossum@users.sourceforge.net (Guido van Rossum) Date: Fri, 06 Sep 2002 22:02:58 -0700 Subject: [Spambayes-checkins] spambayes .cvsignore,NONE,1.1 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv11644 Added Files: .cvsignore Log Message: Ignore certain files. --- NEW FILE: .cvsignore --- *.pyc *.pyo *.db *.pik *.zip From tim_one@users.sourceforge.net Sat Sep 7 06:11:34 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Fri, 06 Sep 2002 22:11:34 -0700 Subject: [Spambayes-checkins] spambayes classifier.py,1.3,1.4 timtest.py,1.10,1.11 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv12667 Modified Files: classifier.py timtest.py Log Message: Shaking things up! MINCOUNT is history. This yields a major improvement in the f-n rate, but may have knocked the f-p rate out of a local minimum. I considered this carefully, and expect you'll agree it's a good change if you read the new comments. There's a surely a better way to get the tiny bit of good that was hiding under MINCOUNT's bad effects. Index: classifier.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/classifier.py,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** classifier.py 6 Sep 2002 22:53:49 -0000 1.3 --- classifier.py 7 Sep 2002 05:11:30 -0000 1.4 *************** *** 58,71 **** # appropriate bias factor.) # ! # XXX Reducing this to 1.0 (effectively not using it at all then) seemed to ! # XXX give a sharp reduction in the f-n rate in a partial test run, while ! # XXX adding a few mysterious f-ps. Then boosting it to 2.0 appeared to ! # XXX give an increase in the f-n rate in a partial test run. This needs ! # XXX deeper investigation. Might also be good to develop a more general ! # XXX concept of confidence: MINCOUNT is a gross gimmick in that direction, ! # XXX effectively saying we have no confidence in probabilities computed ! # XXX from fewer than MINCOUNT instances, but unbounded confidence in ! # XXX probabilities computed from at least MINCOUNT instances. ! MINCOUNT = 5.0 # The maximum number of words spamprob() pays attention to. Graham had 15 --- 58,145 ---- # appropriate bias factor.) # ! # Twist: Graham used MINCOUNT=5.0 here. I got rid of it: in effect, ! # given HAMBIAS=2.0, it meant we ignored a possibly perfectly good piece ! # of spam evidence unless it appeared at least 5 times, and ditto for ! # ham evidence unless it appeared at least 3 times. That certainly does ! # bias in favor of ham, but multiple distortions in favor of ham are ! # multiple ways to get confused and trip up. Here are the test results ! # before and after, MINCOUNT=5.0 on the left, no MINCOUNT on the right; ! # ham sets had 4000 msgs (so 0.025% is one msg), and spam sets 2750: ! # ! # false positive percentages ! # 0.000 0.000 tied ! # 0.000 0.000 tied ! # 0.100 0.050 won -50.00% ! # 0.000 0.025 lost +(was 0) ! # 0.025 0.075 lost +200.00% ! # 0.025 0.000 won -100.00% ! # 0.100 0.100 tied ! # 0.025 0.050 lost +100.00% ! # 0.025 0.025 tied ! # 0.050 0.025 won -50.00% ! # 0.100 0.050 won -50.00% ! # 0.025 0.050 lost +100.00% ! # 0.025 0.050 lost +100.00% ! # 0.025 0.000 won -100.00% ! # 0.025 0.000 won -100.00% ! # 0.025 0.075 lost +200.00% ! # 0.025 0.025 tied ! # 0.000 0.000 tied ! # 0.025 0.025 tied ! # 0.100 0.050 won -50.00% ! # ! # won 7 times ! # tied 7 times ! # lost 6 times ! # ! # total unique fp went from 9 to 13 ! # ! # false negative percentages ! # 0.364 0.327 won -10.16% ! # 0.400 0.400 tied ! # 0.400 0.327 won -18.25% ! # 0.909 0.691 won -23.98% ! # 0.836 0.545 won -34.81% ! # 0.618 0.291 won -52.91% ! # 0.291 0.218 won -25.09% ! # 1.018 0.654 won -35.76% ! # 0.982 0.364 won -62.93% ! # 0.727 0.291 won -59.97% ! # 0.800 0.327 won -59.13% ! # 1.163 0.691 won -40.58% ! # 0.764 0.582 won -23.82% ! # 0.473 0.291 won -38.48% ! # 0.473 0.364 won -23.04% ! # 0.727 0.436 won -40.03% ! # 0.655 0.436 won -33.44% ! # 0.509 0.218 won -57.17% ! # 0.545 0.291 won -46.61% ! # 0.509 0.254 won -50.10% ! # ! # won 19 times ! # tied 1 times ! # lost 0 times ! # ! # total unique fn went from 168 to 106 ! # ! # So dropping MINCOUNT was a huge win for the f-n rate, and a mixed bag ! # for the f-p rate (but the f-p rate was so low compared to 4000 msgs that ! # even the losses were barely significant). In addition, dropping MINCOUNT ! # had a larger good effect when using random training subsets of size 500; ! # this makes intuitive sense, as with less training data it was harder to ! # exceed the MINCOUNT threshold. ! # ! # Still, MINCOUNT seemed to be a gross approximation to *something* valuable: ! # a strong clue appearing in 1,000 training msgs is certainly more trustworthy ! # than an equally strong clue appearing in only 1 msg. I'm almost certain it ! # would pay to develop a way to take that into account when scoring. In ! # particular, there was a very specific new class of false positives ! # introduced by dropping MINCOUNT: some c.l.py msgs consisting mostly of ! # Spanish or French. The "high probability" spam clues were innocuous ! # words like "puedo" and "como", that appeared in very rare Spanish and ! # French spam too. There has to be a more principled way to address this ! # than the MINCOUNT hammer, and the test results clearly showed that MINCOUNT ! # did more harm than good overall. ! # The maximum number of words spamprob() pays attention to. Graham had 15 *************** *** 477,493 **** hamcount = HAMBIAS * record.hamcount spamcount = SPAMBIAS * record.spamcount ! if hamcount + spamcount < MINCOUNT: ! prob = UNKNOWN_SPAMPROB ! else: ! hamratio = min(1.0, hamcount / nham) ! spamratio = min(1.0, spamcount / nspam) - prob = spamratio / (hamratio + spamratio) - if prob < MIN_SPAMPROB: - prob = MIN_SPAMPROB - elif prob > MAX_SPAMPROB: - prob = MAX_SPAMPROB if record.spamprob != prob: record.spamprob = prob self.wordinfo[word] = record --- 551,567 ---- hamcount = HAMBIAS * record.hamcount spamcount = SPAMBIAS * record.spamcount ! hamratio = min(1.0, hamcount / nham) ! spamratio = min(1.0, spamcount / nspam) ! ! prob = spamratio / (hamratio + spamratio) ! if prob < MIN_SPAMPROB: ! prob = MIN_SPAMPROB ! elif prob > MAX_SPAMPROB: ! prob = MAX_SPAMPROB if record.spamprob != prob: record.spamprob = prob + # The next seemingly pointless line appears to be a hack + # to allow a persistent db to realize the record has changed. self.wordinfo[word] = record *************** *** 497,515 **** print "P(%r) = %g" % (w, r.spamprob) ! def clearjunk(self, oldesttime, mincount=MINCOUNT): """Forget useless wordinfo records. This can shrink the database size. A record for a word will be retained only if the word was accessed ! at or after oldesttime, or appeared at least mincount times in ! messages passed to learn(). mincount is optional, and defaults ! to the value an internal algorithm uses to decide that a word is so ! rare that it has no predictive value. """ wordinfo = self.wordinfo mincount = float(mincount) ! tonuke = [w for w, r in wordinfo.iteritems() ! if r.atime < oldesttime and ! SPAMBIAS*r.spamcount + HAMBIAS*r.hamcount < mincount] for w in tonuke: if self.DEBUG: --- 571,584 ---- print "P(%r) = %g" % (w, r.spamprob) ! def clearjunk(self, oldesttime): """Forget useless wordinfo records. This can shrink the database size. A record for a word will be retained only if the word was accessed ! at or after oldesttime. """ wordinfo = self.wordinfo mincount = float(mincount) ! tonuke = [w for w, r in wordinfo.iteritems() if r.atime < oldesttime] for w in tonuke: if self.DEBUG: Index: timtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timtest.py,v retrieving revision 1.10 retrieving revision 1.11 diff -C2 -d -r1.10 -r1.11 *** timtest.py 7 Sep 2002 00:31:56 -0000 1.10 --- timtest.py 7 Sep 2002 05:11:31 -0000 1.11 *************** *** 98,101 **** --- 98,110 ---- yield Msg(directory, fname) + def xproduce(self): + import random + directory = self.directory + all = os.listdir(directory) + random.seed(hash(directory)) + random.shuffle(all) + for fname in all[-500:]: + yield Msg(directory, fname) + def __iter__(self): return self.produce() From tim.one@comcast.net Sat Sep 7 06:20:47 2002 From: tim.one@comcast.net (Tim Peters) Date: Sat, 07 Sep 2002 01:20:47 -0400 Subject: [Spambayes-checkins] spambayes timtoken.py,1.6,1.7 In-Reply-To: Message-ID: [Guido] > Modified Files: > timtoken.py > Log Message: > Made tokenize() polymorphic. It now accepts an email.Message.Message > instance, a file-like object (something with a readline method), or a > string (anything else). Good change. One question/concern: > ... > --- 2,5 ---- > *************** > *** 555,566 **** > yield 'content-transfer-encoding:' + x.lower() > > ! def tokenize(string): > # Create an email Message object. > ! try: > ! msg = message_from_string(string) > ! except email.Errors.MessageParseError: > ! yield 'control: MessageParseError' > ! # XXX Fall back to the raw body text? > ! return > > # Special tagging of header lines. > --- 554,570 ---- > yield 'content-transfer-encoding:' + x.lower() > > ! def tokenize(obj): > # Create an email Message object. > ! if isinstance(obj, email.Message.Message): > ! msg = obj > ! elif hasattr(obj, "readline"): > ! msg = email.message_from_file(obj) > ! else: > ! try: > ! msg = email.message_from_string(obj) > ! except email.Errors.MessageParseError: > ! yield 'control: MessageParseError' > ! # XXX Fall back to the raw body text? > ! return > > # Special tagging of header lines. It's a fact of life that some messages can't be parsed by the email package, and the code was careful to catch that when parsing from a string. I don't see anything here to protect the system from dying if a message can't be parsed from file. Barry, when would MessageParseError get raised then? At the time message_from_file() is called (in which case fixing the above is easy), or at some later time when trying to invoke some method of the Message object (in which case I'm not sure what to do)? From guido@python.org Sat Sep 7 06:35:37 2002 From: guido@python.org (Guido van Rossum) Date: Sat, 07 Sep 2002 01:35:37 -0400 Subject: [Spambayes-checkins] spambayes timtoken.py,1.6,1.7 In-Reply-To: Your message of "Sat, 07 Sep 2002 01:20:47 EDT." References: Message-ID: <200209070535.g875Zbm13523@pcp02138704pcs.reston01.va.comcast.net> > > Made tokenize() polymorphic. It now accepts an email.Message.Message > > instance, a file-like object (something with a readline method), or a > > string (anything else). > > Good change. One question/concern: > > > ! def tokenize(obj): > > # Create an email Message object. > > ! if isinstance(obj, email.Message.Message): > > ! msg = obj > > ! elif hasattr(obj, "readline"): > > ! msg = email.message_from_file(obj) > > ! else: > > ! try: > > ! msg = email.message_from_string(obj) > > ! except email.Errors.MessageParseError: > > ! yield 'control: MessageParseError' > > ! # XXX Fall back to the raw body text? > > ! return > > > > # Special tagging of header lines. > > It's a fact of life that some messages can't be parsed by the email package, > and the code was careful to catch that when parsing from a string. I don't > see anything here to protect the system from dying if a message can't be > parsed from file. Barry, when would MessageParseError get raised then? At > the time message_from_file() is called (in which case fixing the above is > easy), or at some later time when trying to invoke some method of the > Message object (in which case I'm not sure what to do)? I'm guessing at the time that message_from_file() is called; message_from_string() is a thin layer on top of that using StringIO, so if the above code works for message_from_string(), it should work for message_from_file(). I'll add it. --Guido van Rossum (home page: http://www.python.org/~guido/) From gvanrossum@users.sourceforge.net Sat Sep 7 06:43:11 2002 From: gvanrossum@users.sourceforge.net (Guido van Rossum) Date: Fri, 06 Sep 2002 22:43:11 -0700 Subject: [Spambayes-checkins] spambayes timtoken.py,1.7,1.8 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv18320 Modified Files: timtoken.py Log Message: Catch MessageParseError when calling message_from_file() too. Index: timtoken.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timtoken.py,v retrieving revision 1.7 retrieving revision 1.8 diff -C2 -d -r1.7 -r1.8 *** timtoken.py 7 Sep 2002 04:50:10 -0000 1.7 --- timtoken.py 7 Sep 2002 05:43:08 -0000 1.8 *************** *** 559,563 **** msg = obj elif hasattr(obj, "readline"): ! msg = email.message_from_file(obj) else: try: --- 559,568 ---- msg = obj elif hasattr(obj, "readline"): ! try: ! msg = email.message_from_file(obj) ! except email.Errors.MessageParseError: ! yield 'control: MessageParseError' ! # XXX Fall back to the raw body text? ! return else: try: From montanaro@users.sourceforge.net Sat Sep 7 06:50:44 2002 From: montanaro@users.sourceforge.net (Skip Montanaro) Date: Fri, 06 Sep 2002 22:50:44 -0700 Subject: [Spambayes-checkins] spambayes unheader.py,NONE,1.1 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv19376 Added Files: unheader.py Log Message: script to remove unwanted headers from mbox files --- NEW FILE: unheader.py --- #!/usr/bin/env python import re import sys import mailbox import email.Parser import email.Message import getopt def unheader(msg, pat): pat = re.compile(pat) for hdr in msg.keys(): if pat.match(hdr): del msg[hdr] class Message(email.Message.Message): def replace_header(self, hdr, newval): """replace first value for hdr with newval""" hdr = hdr.lower() for (i, (k, v)) in enumerate(self._headers): if k.lower() == hdr: self._headers[i] = (k, newval) class Parser(email.Parser.Parser): def __init__(self): email.Parser.Parser.__init__(self, Message) def deSA(msg): if msg['X-Spam-Status']: if msg['X-Spam-Status'].startswith('Yes'): pct = msg['X-Spam-Prev-Content-Type'] if pct: msg['Content-Type'] = pct pcte = msg['X-Spam-Prev-Content-Transfer-Encoding'] if pcte: msg['Content-Transfer-Encoding'] = pcte subj = re.sub(r'\*\*\*\*\*SPAM\*\*\*\*\* ', '', msg['Subject']) if subj != msg["Subject"]: msg.replace_header("Subject", subj) body = msg.get_payload() newbody = [] at_start = 1 for line in body.splitlines(): if at_start and line.startswith('SPAM: '): continue elif at_start: at_start = 0 else: newbody.append(line) msg.set_payload("\n".join(newbody)) unheader(msg, "X-Spam-") def process_mailbox(f, dosa=1, pats=None): for msg in mailbox.PortableUnixMailbox(f, Parser().parse): if pats is not None: unheader(msg, pats) if dosa: deSA(msg) print msg def usage(): print >> sys.stderr, "usage: unheader.py [ -p pat ... ] [ -s ]" print >> sys.stderr, "-p pat gives a regex pattern used to eliminate unwanted headers" print >> sys.stderr, "'-p pat' may be given multiple times" print >> sys.stderr, "-s tells not to remove SpamAssassin headers" def main(args): headerpats = [] dosa = 1 try: opts, args = getopt.getopt(args, "p:sh") except getopt.GetoptError: usage() sys.exit(1) else: for opt, arg in opts: if opt == "-h": usage() sys.exit(0) elif opt == "-p": headerpats.append(arg) elif opt == "-s": dosa = 0 pats = headerpats and "|".join(headerpats) or None if not args: f = sys.stdin elif len(args) == 1: f = file(args[0]) else: usage() sys.exit(1) process_mailbox(f, dosa, pats) if __name__ == "__main__": main(sys.argv[1:]) From montanaro@users.sourceforge.net Sat Sep 7 06:51:07 2002 From: montanaro@users.sourceforge.net (Skip Montanaro) Date: Fri, 06 Sep 2002 22:51:07 -0700 Subject: [Spambayes-checkins] spambayes README.txt,1.7,1.8 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv19440 Modified Files: README.txt Log Message: add blurb about unheader.py Index: README.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/README.txt,v retrieving revision 1.7 retrieving revision 1.8 diff -C2 -d -r1.7 -r1.8 *** README.txt 7 Sep 2002 04:28:13 -0000 1.7 --- README.txt 7 Sep 2002 05:51:05 -0000 1.8 *************** *** 50,53 **** --- 50,57 ---- tokenize() function of your choosing. + unheader.py + A script to remove unwanted headers from an mbox file. This is mostly + useful to delete headers which incorrectly might bias the results. + GBayes.py A number of tokenizers and a partial test driver. This assumes From montanaro@users.sourceforge.net Sat Sep 7 06:52:50 2002 From: montanaro@users.sourceforge.net (Skip Montanaro) Date: Fri, 06 Sep 2002 22:52:50 -0700 Subject: [Spambayes-checkins] spambayes setup.py,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv19640 Modified Files: setup.py Log Message: * handle timtoken.py, unheader.py and hammie.py * zap GBayes.py * should timtoken and classifier go into a spambayes package in site-packages? Index: setup.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/setup.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** setup.py 5 Sep 2002 16:16:43 -0000 1.1 --- setup.py 7 Sep 2002 05:52:48 -0000 1.2 *************** *** 3,8 **** setup( name='spambayes', ! scripts=['GBayes.py'], ! py_modules=['classifier'] ) --- 3,8 ---- setup( name='spambayes', ! scripts=['unheader.py', 'hammie.py'], ! py_modules=['classifier', 'timtoken'] ) From montanaro@users.sourceforge.net Sat Sep 7 06:53:15 2002 From: montanaro@users.sourceforge.net (Skip Montanaro) Date: Fri, 06 Sep 2002 22:53:15 -0700 Subject: [Spambayes-checkins] spambayes .cvsignore,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv19725 Modified Files: .cvsignore Log Message: ignore the distutils build dir Index: .cvsignore =================================================================== RCS file: /cvsroot/spambayes/spambayes/.cvsignore,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** .cvsignore 7 Sep 2002 05:02:56 -0000 1.1 --- .cvsignore 7 Sep 2002 05:53:12 -0000 1.2 *************** *** 4,5 **** --- 4,6 ---- *.pik *.zip + build From rubiconx@users.sourceforge.net Sat Sep 7 07:11:12 2002 From: rubiconx@users.sourceforge.net (Neale Pickett) Date: Fri, 06 Sep 2002 23:11:12 -0700 Subject: [Spambayes-checkins] spambayes hammie.py,1.9,1.10 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv22737 Modified Files: hammie.py Log Message: Changes X-Spam-Disposition header to X-Hammie-Disposition Index: hammie.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammie.py,v retrieving revision 1.9 retrieving revision 1.10 diff -C2 -d -r1.9 -r1.10 *** hammie.py 7 Sep 2002 04:50:45 -0000 1.9 --- hammie.py 7 Sep 2002 06:11:10 -0000 1.10 *************** *** 3,7 **** # A driver for the classifier module. Currently mostly a wrapper around ! # existing stuff. """Usage: %(program)s [options] --- 3,8 ---- # A driver for the classifier module. Currently mostly a wrapper around ! # existing stuff. Neale Pickett is the person to ! # blame for this. """Usage: %(program)s [options] *************** *** 41,44 **** --- 42,48 ---- program = sys.argv[0] # For usage(); referenced by docstring above + # Name of the header to add in filter mode + DISPHEADER = "X-Hammie-Disposition" + # Tim's tokenizer kicks far more booty than anything I would have # written. Score one for analysis ;) *************** *** 75,79 **** raise KeyError(key) ! def __setitem__(self, key, val): v = pickle.dumps(val, 1) self.hash[key] = v --- 79,83 ---- raise KeyError(key) ! def __setitem__(self, key, val): v = pickle.dumps(val, 1) self.hash[key] = v *************** *** 86,90 **** while k != None: key = k[0] ! val = pickle.loads(k[1]) if key not in self.iterskip: if fn: --- 90,94 ---- while k != None: key = k[0] ! val = self.__getitem__(key) if key not in self.iterskip: if fn: *************** *** 234,238 **** disp += "; %.2f" % prob disp += "; " + formatclues(clues) ! msg.add_header("X-Spam-Disposition", disp) output.write(str(msg)) --- 238,242 ---- disp += "; %.2f" % prob disp += "; " + formatclues(clues) ! msg.add_header(DISPHEADER, disp) output.write(str(msg)) From gvanrossum@users.sourceforge.net Sat Sep 7 07:18:05 2002 From: gvanrossum@users.sourceforge.net (Guido van Rossum) Date: Fri, 06 Sep 2002 23:18:05 -0700 Subject: [Spambayes-checkins] spambayes hammie.py,1.10,1.11 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv24222 Modified Files: hammie.py Log Message: filter(): output 'unixfrom' line only if it was present on input. Index: hammie.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammie.py,v retrieving revision 1.10 retrieving revision 1.11 diff -C2 -d -r1.10 -r1.11 *** hammie.py 7 Sep 2002 06:11:10 -0000 1.10 --- hammie.py 7 Sep 2002 06:18:03 -0000 1.11 *************** *** 239,243 **** disp += "; " + formatclues(clues) msg.add_header(DISPHEADER, disp) ! output.write(str(msg)) def score(bayes, msgs): --- 239,243 ---- disp += "; " + formatclues(clues) msg.add_header(DISPHEADER, disp) ! output.write(msg.as_string(unixfrom=(msg.get_unixfrom() is not None))) def score(bayes, msgs): From jhylton@users.sourceforge.net Sat Sep 7 17:14:11 2002 From: jhylton@users.sourceforge.net (Jeremy Hylton) Date: Sat, 07 Sep 2002 09:14:11 -0700 Subject: [Spambayes-checkins] spambayes tokenizer.py,NONE,1.1 README.txt,1.8,1.9 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv17305 Modified Files: README.txt Added Files: tokenizer.py Log Message: Refactor timtoken into tokenzier. The Tokenizer class has two methods tokenize_headers() and tokenize_body() that encapsulate most of timtoken's logic. It is a little easier to extend this class than timtoken, because you can override either header or body processing individually. --- NEW FILE: tokenizer.py --- """Module to tokenize email messages for spam filtering.""" import email import re from sets import Set # Find all the text components of the msg. There's no point decoding # binary blobs (like images). If a multipart/alternative has both plain # text and HTML versions of a msg, ignore the HTML part: HTML decorations # have monster-high spam probabilities, and innocent newbies often post # using HTML. def textparts(msg): text = Set() redundant_html = Set() for part in msg.walk(): if part.get_content_type() == 'multipart/alternative': # Descend this part of the tree, adding any redundant HTML text # part to redundant_html. htmlpart = textpart = None stack = part.get_payload() while stack: subpart = stack.pop() ctype = subpart.get_content_type() if ctype == 'text/plain': textpart = subpart elif ctype == 'text/html': htmlpart = subpart elif ctype == 'multipart/related': stack.extend(subpart.get_payload()) if textpart is not None: text.add(textpart) if htmlpart is not None: redundant_html.add(htmlpart) elif htmlpart is not None: text.add(htmlpart) elif part.get_content_maintype() == 'text': text.add(part) return text - redundant_html ############################################################################## # To fold case or not to fold case? I didn't want to fold case, because # it hides information in English, and I have no idea what .lower() does # to other languages; and, indeed, 'FREE' (all caps) turned out to be one # of the strongest spam indicators in my content-only tests (== one with # prob 0.99 *and* made it into spamprob's nbest list very often). # # Against preservering case, it makes the database size larger, and requires # more training data to get enough "representative" mixed-case examples. # # Running my c.l.py tests didn't support my intuition that case was # valuable, so it's getting folded away now. Folding or not made no # significant difference to the false positive rate, and folding made a # small (but statistically significant all the same) reduction in the # false negative rate. There is one obvious difference: after folding # case, conference announcements no longer got high spam scores. Their # content was usually fine, but they were highly penalized for VISIT OUR # WEBSITE FOR MORE INFORMATION! kinds of repeated SCREAMING. That is # indeed the language of advertising, and I halfway regret that folding # away case no longer picks on them. # # Since the f-p rate didn't change, but conference announcements escaped # that category, something else took their place. It seems to be highly # off-topic messages, like debates about Microsoft's place in the world. # Talk about "money" and "lucrative" is indistinguishable now from talk # about "MONEY" and "LUCRATIVE", and spam mentions MONEY a lot. ############################################################################## # Character n-grams or words? # # With careful multiple-corpora c.l.py tests sticking to case-folded decoded # text-only portions, and ignoring headers, and with identical special # parsing & tagging of embedded URLs: # # Character 3-grams gave 5x as many false positives as split-on-whitespace # (s-o-w). The f-n rate was also significantly worse, but within a factor # of 2. So character 3-grams lost across the board. # # Character 5-grams gave 32% more f-ps than split-on-whitespace, but the # s-o-w fp rate across 20,000 presumed-hams was 0.1%, and this is the # difference between 23 and 34 f-ps. There aren't enough there to say that's # significnatly more with killer-high confidence. There were plenty of f-ns, # though, and the f-n rate with character 5-grams was substantially *worse* # than with character 3-grams (which in turn was substantially worse than # with s-o-w). # # Training on character 5-grams creates many more unique tokens than s-o-w: # a typical run bloated to 150MB process size. It also ran a lot slower than # s-o-w, partly related to heavy indexing of a huge out-of-cache wordinfo # dict. I rarely noticed disk activity when running s-o-w, so rarely bothered # to look at process size; it was under 30MB last time I looked. # # Figuring out *why* a msg scored as it did proved much more mysterious when # working with character n-grams: they often had no obvious "meaning". In # contrast, it was always easy to figure out what s-o-w was picking up on. # 5-grams flagged a msg from Christian Tismer as spam, where he was discussing # the speed of tasklets under his new implementation of stackless: # # prob = 0.99999998959 # prob('ed sw') = 0.01 # prob('http0:pgp') = 0.01 # prob('http0:python') = 0.01 # prob('hlon ') = 0.99 # prob('http0:wwwkeys') = 0.01 # prob('http0:starship') = 0.01 # prob('http0:stackless') = 0.01 # prob('n xp ') = 0.99 # prob('on xp') = 0.99 # prob('p 150') = 0.99 # prob('lon x') = 0.99 # prob(' amd ') = 0.99 # prob(' xp 1') = 0.99 # prob(' athl') = 0.99 # prob('1500+') = 0.99 # prob('xp 15') = 0.99 # # The spam decision was baffling until I realized that *all* the high- # probablity spam 5-grams there came out of a single phrase: # # AMD Athlon XP 1500+ # # So Christian was punished for using a machine lots of spam tries to sell # . In a classic Bayesian classifier, this probably wouldn't have # mattered, but Graham's throws away almost all the 5-grams from a msg, # saving only the about-a-dozen farthest from a neutral 0.5. So one bad # phrase can kill you! This appears to happen very rarely, but happened # more than once. # # The conclusion is that character n-grams have almost nothing to recommend # them under Graham's scheme: harder to work with, slower, much larger # database, worse results, and prone to rare mysterious disasters. # # There's one area they won hands-down: detecting spam in what I assume are # Asian languages. The s-o-w scheme sometimes finds only line-ends to split # on then, and then a "hey, this 'word' is way too big! let's ignore it" # gimmick kicks in, and produces no tokens at all. # # [Later: we produce character 5-grams then under the s-o-w scheme, instead # ignoring the blob, but only if there are high-bit characters in the blob; # e.g., there's no point 5-gramming uuencoded lines, and doing so would # bloat the database size.] # # Interesting: despite that odd example above, the *kinds* of f-p mistakes # 5-grams made were very much like s-o-w made -- I recognized almost all of # the 5-gram f-p messages from previous s-o-w runs. For example, both # schemes have a particular hatred for conference announcements, although # s-o-w stopped hating them after folding case. But 5-grams still hate them. # Both schemes also hate msgs discussing HTML with examples, with about equal # passion. Both schemes hate brief "please subscribe [unsubscribe] me" # msgs, although 5-grams seems to hate them more. ############################################################################## # How to tokenize? # # I started with string.split() merely for speed. Over time I realized it # was making interesting context distinctions qualitatively akin to n-gram # schemes; e.g., "free!!" is a much stronger spam indicator than "free". But # unlike n-grams (whether word- or character- based) under Graham's scoring # scheme, this mild context dependence never seems to go over the edge in # giving "too much" credence to an unlucky phrase. # # OTOH, compared to "searching for words", it increases the size of the # database substantially, less than but close to a factor of 2. This is very # much less than a word bigram scheme bloats it, but as always an increase # isn't justified unless the results are better. # # Following are stats comparing # # for token in text.split(): # left column # # to # # for token in re.findall(r"[\w$\-\x80-\xff]+", text): # right column # # text is case-normalized (text.lower()) in both cases, and the runs were # identical in all other respects. The results clearly favor the split() # gimmick, although they vaguely suggest that some sort of compromise # may do as well with less database burden; e.g., *perhaps* folding runs of # "punctuation" characters into a canonical representative could do that. # But the database size is reasonable without that, and plain split() avoids # having to worry about how to "fold punctuation" in languages other than # English. # # false positive percentages # 0.000 0.000 tied # 0.000 0.050 lost # 0.050 0.150 lost # 0.000 0.025 lost # 0.025 0.050 lost # 0.025 0.075 lost # 0.050 0.150 lost # 0.025 0.000 won # 0.025 0.075 lost # 0.000 0.025 lost # 0.075 0.150 lost # 0.050 0.050 tied # 0.025 0.050 lost # 0.000 0.025 lost # 0.050 0.025 won # 0.025 0.000 won # 0.025 0.025 tied # 0.000 0.025 lost # 0.025 0.075 lost # 0.050 0.175 lost # # won 3 times # tied 3 times # lost 14 times # # total unique fp went from 8 to 20 # # false negative percentages # 0.945 1.200 lost # 0.836 1.018 lost # 1.200 1.200 tied # 1.418 1.636 lost # 1.455 1.418 won # 1.091 1.309 lost # 1.091 1.272 lost # 1.236 1.563 lost # 1.564 1.855 lost # 1.236 1.491 lost # 1.563 1.599 lost # 1.563 1.781 lost # 1.236 1.709 lost # 0.836 0.982 lost # 0.873 1.382 lost # 1.236 1.527 lost # 1.273 1.418 lost # 1.018 1.273 lost # 1.091 1.091 tied # 1.490 1.454 won # # won 2 times # tied 2 times # lost 16 times # # total unique fn went from 292 to 302 ############################################################################## # What about HTML? # # Computer geeks seem to view use of HTML in mailing lists and newsgroups as # a mortal sin. Normal people don't, but so it goes: in a technical list/ # group, every HTML decoration has spamprob 0.99, there are lots of unique # HTML decorations, and lots of them appear at the very start of the message # so that Graham's scoring scheme latches on to them tight. As a result, # any plain text message just containing an HTML example is likely to be # judged spam (every HTML decoration is an extreme). # # So if a message is multipart/alternative with both text/plain and text/html # branches, we ignore the latter, else newbies would never get a message # through. If a message is just HTML, it has virtually no chance of getting # through. # # In an effort to let normal people use mailing lists too , and to # alleviate the woes of messages merely *discussing* HTML practice, I # added a gimmick to strip HTML tags after case-normalization and after # special tagging of embedded URLs. This consisted of a regexp sub pattern, # where instances got replaced by single blanks: # # html_re = re.compile(r""" # < # [^\s<>] # e.g., don't match 'a < b' or '<<<' or 'i << 5' or 'a<>b' # [^>]{0,128} # search for the end '>', but don't chew up the world # > # """, re.VERBOSE) # # and then # # text = html_re.sub(' ', text) # # Alas, little good came of this: # # false positive percentages # 0.000 0.000 tied # 0.000 0.000 tied # 0.050 0.075 lost # 0.000 0.000 tied # 0.025 0.025 tied # 0.025 0.025 tied # 0.050 0.050 tied # 0.025 0.025 tied # 0.025 0.025 tied # 0.000 0.050 lost # 0.075 0.100 lost # 0.050 0.050 tied # 0.025 0.025 tied # 0.000 0.025 lost # 0.050 0.050 tied # 0.025 0.025 tied # 0.025 0.025 tied # 0.000 0.000 tied # 0.025 0.050 lost # 0.050 0.050 tied # # won 0 times # tied 15 times # lost 5 times # # total unique fp went from 8 to 12 # # false negative percentages # 0.945 1.164 lost # 0.836 1.418 lost # 1.200 1.272 lost # 1.418 1.272 won # 1.455 1.273 won # 1.091 1.382 lost # 1.091 1.309 lost # 1.236 1.381 lost # 1.564 1.745 lost # 1.236 1.564 lost # 1.563 1.781 lost # 1.563 1.745 lost # 1.236 1.455 lost # 0.836 0.982 lost # 0.873 1.309 lost # 1.236 1.381 lost # 1.273 1.273 tied # 1.018 1.273 lost # 1.091 1.200 lost # 1.490 1.599 lost # # won 2 times # tied 1 times # lost 17 times # # total unique fn went from 292 to 327 # # The messages merely discussing HTML were no longer fps, so it did what it # intended there. But the f-n rate nearly doubled on at least one run -- so # strong a set of spam indicators is the mere presence of HTML. The increase # in the number of fps despite that the HTML-discussing msgs left that # category remains mysterious to me, but it wasn't a significant increase # so I let it drop. # # Later: If I simply give up on making mailing lists friendly to my sisters # (they're not nerds, and create wonderfully attractive HTML msgs), a # compromise is to strip HTML tags from only text/plain msgs. That's # principled enough so far as it goes, and eliminates the HTML-discussing # false positives. It remains disturbing that the f-n rate on pure HTML # msgs increases significantly when stripping tags, so the code here doesn't # do that part. However, even after stripping tags, the rates above show that # at least 98% of spams are still correctly identified as spam. # XXX So, if another way is found to slash the f-n rate, the decision here # XXX not to strip HTML from HTML-only msgs should be revisited. url_re = re.compile(r""" (https? | ftp) # capture the protocol :// # skip the boilerplate # Do a reasonable attempt at detecting the end. It may or may not # be in HTML, may or may not be in quotes, etc. If it's full of % # escapes, cool -- that's a clue too. ([^\s<>'"\x7f-\xff]+) # capture the guts """, re.VERBOSE) urlsep_re = re.compile(r"[;?:@&=+,$.]") has_highbit_char = re.compile(r"[\x80-\xff]").search # Cheap-ass gimmick to probabilistically find HTML/XML tags. html_re = re.compile(r""" < [^\s<>] # e.g., don't match 'a < b' or '<<<' or 'i << 5' or 'a<>b' [^>]{0,128} # search for the end '>', but don't run wild > """, re.VERBOSE) # I'm usually just splitting on whitespace, but for subject lines I want to # break things like "Python/Perl comparison?" up. OTOH, I don't want to # break up the unitized numbers in spammish subject phrases like "Increase # size 79%" or "Now only $29.95!". Then again, I do want to break up # "Python-Dev". subject_word_re = re.compile(r"[\w\x80-\xff$.%]+") def tokenize_word(word, _len=len): n = _len(word) # XXX How big should "a word" be? # XXX I expect 12 is fine -- a test run boosting to 13 had no effect # XXX on f-p rate, and did a little better or worse than 12 across # XXX runs -- overall, no significant difference. It's only "common # XXX sense" so far driving the exclusion of lengths 1 and 2. # Make sure this range matches in tokenize(). if 3 <= n <= 12: yield word elif n >= 3: # A long word. # Don't want to skip embedded email addresses. if n < 40 and '.' in word and word.count('@') == 1: p1, p2 = word.split('@') yield 'email name:' + p1 for piece in p2.split('.'): yield 'email addr:' + piece # If there are any high-bit chars, # tokenize it as byte 5-grams. # XXX This really won't work for high-bit languages -- the scoring # XXX scheme throws almost everything away, and one bad phrase can # XXX generate enough bad 5-grams to dominate the final score. # XXX This also increases the database size substantially. elif has_highbit_char(word): for i in xrange(n-4): yield "5g:" + word[i : i+5] else: # It's a long string of "normal" chars. Ignore it. # For example, it may be an embedded URL (which we already # tagged), or a uuencoded line. # There's value in generating a token indicating roughly how # many chars were skipped. This has real benefit for the f-n # rate, but is neutral for the f-p rate. I don't know why! # XXX Figure out why, and/or see if some other way of summarizing # XXX this info has greater benefit. yield "skip:%c %d" % (word[0], n // 10 * 10) # Generate tokens for: # Content-Type # and its type= param # Content-Dispostion # and its filename= param # all the charsets # # This has huge benefit for the f-n rate, and virtually none on the f-p rate, # although it does reduce the variance of the f-p rate across different # training sets (really marginal msgs, like a brief HTML msg saying just # "unsubscribe me", are almost always tagged as spam now; before they were # right on the edge, and now the multipart/alternative pushes them over it # more consistently). # # XXX I put all of this in as one chunk. I don't know which parts are # XXX most effective; it could be that some parts don't help at all. But # XXX given the nature of the c.l.py tests, it's not surprising that the # XXX 'content-type:text/html' # XXX token is now the single most powerful spam indicator (== makes it # XXX into the nbest list most often). What *is* a little surprising is # XXX that this doesn't push more mixed-type msgs into the f-p camp -- # XXX unlike looking at *all* HTML tags, this is just one spam indicator # XXX instead of dozens, so relevant msg content can cancel it out. # # A bug in this code prevented Content-Transfer-Encoding from getting # picked up. Fixing that bug showed that it didn't helpe, so the corrected # code is disabled now (left column without Content-Transfer-Encoding, # right column with it); # # false positive percentages # 0.000 0.000 tied # 0.000 0.000 tied # 0.100 0.100 tied # 0.000 0.000 tied # 0.025 0.025 tied # 0.025 0.025 tied # 0.100 0.100 tied # 0.025 0.025 tied # 0.025 0.025 tied # 0.050 0.050 tied # 0.100 0.100 tied # 0.025 0.025 tied # 0.025 0.025 tied # 0.025 0.025 tied # 0.025 0.025 tied # 0.025 0.025 tied # 0.025 0.025 tied # 0.000 0.025 lost +(was 0) # 0.025 0.025 tied # 0.100 0.100 tied # # won 0 times # tied 19 times # lost 1 times # # total unique fp went from 9 to 10 # # false negative percentages # 0.364 0.400 lost +9.89% # 0.400 0.364 won -9.00% # 0.400 0.436 lost +9.00% # 0.909 0.872 won -4.07% # 0.836 0.836 tied # 0.618 0.618 tied # 0.291 0.291 tied # 1.018 0.981 won -3.63% # 0.982 0.982 tied # 0.727 0.727 tied # 0.800 0.800 tied # 1.163 1.127 won -3.10% # 0.764 0.836 lost +9.42% # 0.473 0.473 tied # 0.473 0.618 lost +30.66% # 0.727 0.763 lost +4.95% # 0.655 0.618 won -5.65% # 0.509 0.473 won -7.07% # 0.545 0.582 lost +6.79% # 0.509 0.509 tied # # won 6 times # tied 8 times # lost 6 times # # total unique fn went from 168 to 169 def crack_content_xyz(msg): x = msg.get_type() if x is not None: yield 'content-type:' + x.lower() x = msg.get_param('type') if x is not None: yield 'content-type/type:' + x.lower() for x in msg.get_charsets(None): if x is not None: yield 'charset:' + x.lower() x = msg.get('content-disposition') if x is not None: yield 'content-disposition:' + x.lower() fname = msg.get_filename() if fname is not None: for x in fname.lower().split('/'): for y in x.split('.'): yield 'filename:' + y if 0: # disabled; see comment before function x = msg.get('content-transfer-encoding') if x is not None: yield 'content-transfer-encoding:' + x.lower() class Tokenizer: def get_message(self, obj): if isinstance(obj, email.Message.Message): return obj else: # Create an email Message object. try: if hasattr(obj, "readline"): return email.message_from_file(obj) else: return email.message_from_string(obj) except email.Errors.MessageParseError: return None def tokenize(self, obj): msg = self.get_message(obj) if msg is None: yield 'control: MessageParseError' # XXX Fall back to the raw body text? return for tok in self.tokenize_headers(msg): yield tok for tok in self.tokenize_body(msg): yield tok def tokenize_headers(self, msg): # Special tagging of header lines. # XXX TODO Neil Schemenauer has gotten a good start on this # XXX (pvt email). The headers in my spam and ham corpora are # XXX so different (they came from different sources) that if # XXX I include them the classifier's job is trivial. Only # XXX some "safe" header lines are included here, where "safe" # XXX is specific to my sorry corpora. # Content-{Type, Disposition} and their params, and charsets. t = '' for x in msg.walk(): for w in crack_content_xyz(x): yield t + w t = '>' # Subject: # Don't ignore case in Subject lines; e.g., 'free' versus 'FREE' is # especially significant in this context. Experiment showed a small # but real benefit to keeping case intact in this specific context. x = msg.get('subject', '') for w in subject_word_re.findall(x): for t in tokenize_word(w): yield 'subject:' + t # Dang -- I can't use Sender:. If I do, # 'sender:email name:python-list-admin' # becomes the most powerful indicator in the whole database. # # From: # Reply-To: for field in ('from',):# 'reply-to',): prefix = field + ':' x = msg.get(field, 'none').lower() for w in x.split(): for t in tokenize_word(w): yield prefix + t # These headers seem to work best if they're not tokenized: just # normalize case and whitespace. # X-Mailer: This is a pure and significant win for the f-n rate; f-p # rate isn't affected. # User-Agent: Skipping it, as it made no difference. Very few spams # had a User-Agent field, but lots of hams didn't either, # and the spam probability of User-Agent was very close to # 0.5 (== not a valuable discriminator) across all # training sets. for field in ('x-mailer',): prefix = field + ':' x = msg.get(field, 'none').lower() yield prefix + ' '.join(x.split()) # Organization: # Oddly enough, tokenizing this doesn't make any difference to # results. However, noting its mere absence is strong enough # to give a tiny improvement in the f-n rate, and since # recording that requires only one token across the whole # database, the cost is also tiny. if msg.get('organization', None) is None: yield "bool:noorg" # XXX Following is a great idea due to Anthony Baxter. I can't use it # XXX on my test data because the header lines are so different between # XXX my ham and spam that it makes a large improvement for bogus # XXX reasons. So it's commented out. But it's clearly a good thing # XXX to do on "normal" data, and subsumes the Organization trick above # XXX in a much more general way, yet at comparable cost. # X-UIDL: # Anthony Baxter's idea. This has spamprob 0.99! The value # is clearly irrelevant, just the presence or absence matters. # However, it's extremely rare in my spam sets, so doesn't # have much value. # # As also suggested by Anthony, we can capture all such header # oddities just by generating tags for the count of how many # times each header field appears. ##x2n = {} ##for x in msg.keys(): ## x2n[x] = x2n.get(x, 0) + 1 ##for x in x2n.items(): ## yield "header:%s:%d" % x def tokenize_body(self, msg): # Find, decode (base64, qp), and tokenize textual parts of the body. for part in textparts(msg): # Decode, or take it as-is if decoding fails. try: text = part.get_payload(decode=True) except: yield "control: couldn't decode" text = part.get_payload(decode=False) if text is None: yield 'control: payload is None' continue # Normalize case. text = text.lower() # Special tagging of embedded URLs. for proto, guts in url_re.findall(text): yield "proto:" + proto # Lose the trailing punctuation for casual embedding, like: # The code is at http://mystuff.org/here? Didn't resolve. # or # I found it at http://mystuff.org/there/. Thanks! assert guts while guts and guts[-1] in '.:?!/': guts = guts[:-1] for i, piece in enumerate(guts.split('/')): prefix = "%s%s:" % (proto, i < 2 and str(i) or '>1') for chunk in urlsep_re.split(piece): yield prefix + chunk # Remove HTML/XML tags if it's a plain text message. if part.get_content_type() == "text/plain": text = html_re.sub(' ', text) # Tokenize everything in the body. for w in text.split(): n = len(w) # Make sure this range matches in tokenize_word(). if 3 <= n <= 12: yield w elif n >= 3: for t in tokenize_word(w): yield t tokenize = Tokenizer().tokenize Index: README.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/README.txt,v retrieving revision 1.8 retrieving revision 1.9 diff -C2 -d -r1.8 -r1.9 *** README.txt 7 Sep 2002 05:51:05 -0000 1.8 --- README.txt 7 Sep 2002 16:14:09 -0000 1.9 *************** *** 37,44 **** A concrete test driver like timtest.py (see below), but working with a pair of mailbox files rather than the specialized timtest ! setup. Note that the validity of results from mboxtest.py have ! yet to be confirmed. ! timtoken.py An implementation of tokenize() that Tim can't seem to help but keep working on . --- 37,43 ---- A concrete test driver like timtest.py (see below), but working with a pair of mailbox files rather than the specialized timtest ! setup. ! tokenizer.py An implementation of tokenize() that Tim can't seem to help but keep working on . From jhylton@users.sourceforge.net Sat Sep 7 17:15:47 2002 From: jhylton@users.sourceforge.net (Jeremy Hylton) Date: Sat, 07 Sep 2002 09:15:47 -0700 Subject: [Spambayes-checkins] spambayes hammie.py,1.11,1.12 setup.py,1.2,1.3 timtest.py,1.11,1.12 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv17725 Modified Files: hammie.py setup.py timtest.py Log Message: Use tokenizer module. XXX Watch out, Tim! I just change timtest out from under you. Index: hammie.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammie.py,v retrieving revision 1.11 retrieving revision 1.12 diff -C2 -d -r1.11 -r1.12 *** hammie.py 7 Sep 2002 06:18:03 -0000 1.11 --- hammie.py 7 Sep 2002 16:15:45 -0000 1.12 *************** *** 47,51 **** # Tim's tokenizer kicks far more booty than anything I would have # written. Score one for analysis ;) ! from timtoken import tokenize class DBDict: --- 47,51 ---- # Tim's tokenizer kicks far more booty than anything I would have # written. Score one for analysis ;) ! from tokenizer import tokenize class DBDict: Index: setup.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/setup.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** setup.py 7 Sep 2002 05:52:48 -0000 1.2 --- setup.py 7 Sep 2002 16:15:45 -0000 1.3 *************** *** 4,8 **** name='spambayes', scripts=['unheader.py', 'hammie.py'], ! py_modules=['classifier', 'timtoken'] ) --- 4,8 ---- name='spambayes', scripts=['unheader.py', 'hammie.py'], ! py_modules=['classifier', 'tokenizer'] ) Index: timtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timtest.py,v retrieving revision 1.11 retrieving revision 1.12 diff -C2 -d -r1.11 -r1.12 *** timtest.py 7 Sep 2002 05:11:31 -0000 1.11 --- timtest.py 7 Sep 2002 16:15:45 -0000 1.12 *************** *** 13,17 **** import Tester import classifier ! from timtoken import tokenize class Hist: --- 13,17 ---- import Tester import classifier ! from tokenizer import tokenize class Hist: *************** *** 63,67 **** print "prob(%r) = %g" % clue print ! guts = msg.guts if charlimit is not None: guts = guts[:charlimit] --- 63,67 ---- print "prob(%r) = %g" % clue print ! guts = str(msg) if charlimit is not None: guts = guts[:charlimit] *************** *** 86,89 **** --- 86,92 ---- return self.tag == other.tag + def __str__(self): + return self.guts + class MsgStream(object): def __init__(self, directory): *************** *** 153,157 **** printhist("all runs:", self.global_ham_hist, self.global_spam_hist) ! def test(self, ham, spam): c = self.classifier t = self.tester --- 156,160 ---- printhist("all runs:", self.global_ham_hist, self.global_spam_hist) ! def test(self, ham, spam, charlimit=None): c = self.classifier t = self.tester *************** *** 168,172 **** print "Low prob spam!", prob prob, clues = c.spamprob(msg, True) ! printmsg(msg, prob, clues) t.reset_test_results() --- 171,175 ---- print "Low prob spam!", prob prob, clues = c.spamprob(msg, True) ! printmsg(msg, prob, clues, charlimit) t.reset_test_results() *************** *** 185,189 **** print '*' * 78 prob, clues = c.spamprob(e, True) ! printmsg(e, prob, clues) newfneg = Set(t.false_negatives()) - self.falseneg --- 188,192 ---- print '*' * 78 prob, clues = c.spamprob(e, True) ! printmsg(e, prob, clues, charlimit) newfneg = Set(t.false_negatives()) - self.falseneg From jhylton@users.sourceforge.net Sat Sep 7 17:17:21 2002 From: jhylton@users.sourceforge.net (Jeremy Hylton) Date: Sat, 07 Sep 2002 09:17:21 -0700 Subject: [Spambayes-checkins] spambayes mboxtest.py,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv18055 Modified Files: mboxtest.py Log Message: A bunch of unrelated updates. Add docstring. Use tokenizer module. Add MyTokenizer that knows less about how to deal with headers. Add custom __str__() to MboxMsg to surpress boring headers. Index: mboxtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/mboxtest.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** mboxtest.py 6 Sep 2002 19:26:34 -0000 1.1 --- mboxtest.py 7 Sep 2002 16:17:19 -0000 1.2 *************** *** 1,5 **** #! /usr/bin/env python ! from timtoken import tokenize from classifier import GrahamBayes from Tester import Test --- 1,26 ---- #! /usr/bin/env python + """mboxtest.py: A test driver for classifier. ! Usage: mboxtest.py [options] ! ! Options: ! -f FMT ! One of unix, mmdf, mh, or qmail. Specifies mailbox format for ! ham and spam files. Default is unix. ! ! -n NSETS ! Number of test sets to create for a single mailbox. Default is 5. ! ! -s SEED ! Seed for random number generator. Default is 101. ! ! -m MSGS ! Read no more than MSGS messages from mailbox. ! ! -l LIMIT ! Print no more than LIMIT characters of a message in test output. ! """ ! ! from tokenizer import Tokenizer, subject_word_re, tokenize_word, tokenize from classifier import GrahamBayes from Tester import Test *************** *** 18,21 **** --- 39,58 ---- } + class MyTokenizer(Tokenizer): + + skip = {'received': 1, + 'date': 1, + 'x-from_': 1, + } + + def tokenize_headers(self, msg): + for k, v in msg.items(): + k = k.lower() + if k in self.skip or k.startswith('x-vm'): + continue + for w in subject_word_re.findall(v): + for t in tokenize_word(w): + yield "%s:%s" % (k, t) + class MboxMsg(Msg): *************** *** 24,27 **** --- 61,86 ---- self.tag = "%s:%s %s" % (path, index, subject(self.guts)) + def __str__(self): + lines = [] + i = 0 + for line in self.guts.split("\n"): + skip = False + for skip_prefix in 'X-', 'Received:', '\t',: + if line.startswith(skip_prefix): + skip = True + if skip: + continue + i += 1 + if i > 100: + lines.append("... truncated") + break + lines.append(line) + return "\n".join(lines) + + ## tokenize = MyTokenizer().tokenize + + def __iter__(self): + return tokenize(self.guts) + class mbox(object): *************** *** 77,82 **** NSETS = 5 SEED = 101 ! LIMIT = None ! opts, args = getopt.getopt(args, "f:n:s:l:") for k, v in opts: if k == '-f': --- 136,142 ---- NSETS = 5 SEED = 101 ! MAXMSGS = None ! CHARLIMIT = 1000 ! opts, args = getopt.getopt(args, "f:n:s:l:m:") for k, v in opts: if k == '-f': *************** *** 87,91 **** SEED = int(v) if k == '-l': ! LIMIT = int(v) ham, spam = args --- 147,153 ---- SEED = int(v) if k == '-l': ! CHARLIMIT = int(v) ! if k == '-m': ! MAXMSGS = int(v) ham, spam = args *************** *** 96,102 **** nspam = len(list(mbox(spam))) ! if LIMIT: ! nham = min(nham, LIMIT) ! nspam = min(nspam, LIMIT) print "ham", ham, nham --- 158,164 ---- nspam = len(list(mbox(spam))) ! if MAXMSGS: ! nham = min(nham, MAXMSGS) ! nspam = min(nspam, MAXMSGS) print "ham", ham, nham *************** *** 115,120 **** if (iham, ispam) == (ihtest, istest): continue ! driver.test(mbox(ham, ihtest), mbox(spam, istest)) ! driver.finish() driver.alldone() --- 177,182 ---- if (iham, ispam) == (ihtest, istest): continue ! driver.test(mbox(ham, ihtest), mbox(spam, istest), CHARLIMIT) ! driver.finishtest() driver.alldone() From jhylton@users.sourceforge.net Sat Sep 7 17:39:06 2002 From: jhylton@users.sourceforge.net (Jeremy Hylton) Date: Sat, 07 Sep 2002 09:39:06 -0700 Subject: [Spambayes-checkins] spambayes rates.py,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv22298 Modified Files: rates.py Log Message: Change to work with mboxtest.py output. Index: rates.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/rates.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** rates.py 5 Sep 2002 23:34:41 -0000 1.1 --- rates.py 7 Sep 2002 16:39:04 -0000 1.2 *************** *** 27,31 **** new false positives: ['Data/Ham/Set2/66645.txt'] """ ! pat1 = re.compile(r'\s*Training on Data/').match pat2 = re.compile(r'\s+false (positive|negative): (.*)').match pat3 = re.compile(r"\s+new false (positives|negatives): \[(.+)\]").match --- 27,31 ---- new false positives: ['Data/Ham/Set2/66645.txt'] """ ! pat1 = re.compile(r'\s*Training on ').match pat2 = re.compile(r'\s+false (positive|negative): (.*)').match pat3 = re.compile(r"\s+new false (positives|negatives): \[(.+)\]").match From rubiconx@users.sourceforge.net Sat Sep 7 18:12:24 2002 From: rubiconx@users.sourceforge.net (Neale Pickett) Date: Sat, 07 Sep 2002 10:12:24 -0700 Subject: [Spambayes-checkins] spambayes hammie.py,1.12,1.13 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv30001 Modified Files: hammie.py Log Message: New DEFAULTDB global variable, updated usage docstring. Index: hammie.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammie.py,v retrieving revision 1.12 retrieving revision 1.13 diff -C2 -d -r1.12 -r1.13 *** hammie.py 7 Sep 2002 16:15:45 -0000 1.12 --- hammie.py 7 Sep 2002 17:12:22 -0000 1.13 *************** *** 2,8 **** # At the moment, this requires Python 2.3 from CVS ! # A driver for the classifier module. Currently mostly a wrapper around ! # existing stuff. Neale Pickett is the person to ! # blame for this. """Usage: %(program)s [options] --- 2,7 ---- # At the moment, this requires Python 2.3 from CVS ! # A driver for the classifier module and Tim's tokenizer that you can ! # call from procmail. """Usage: %(program)s [options] *************** *** 19,23 **** -p FILE use file as the persistent store. loads data from this file if it ! exists, and saves data to this file at the end. Default: hammie.db -d use the DBM store instead of cPickle. The file is larger and --- 18,22 ---- -p FILE use file as the persistent store. loads data from this file if it ! exists, and saves data to this file at the end. Default: %(DEFAULTDB)s -d use the DBM store instead of cPickle. The file is larger and *************** *** 26,30 **** -f run as a filter: read a single message from stdin, add an ! X-Spam-Disposition header, and write it to stdout. """ --- 25,29 ---- -f run as a filter: read a single message from stdin, add an ! %(DISPHEADER)s header, and write it to stdout. """ *************** *** 45,48 **** --- 44,50 ---- DISPHEADER = "X-Hammie-Disposition" + # Default database name + DEFAULTDB = "hammie.db" + # Tim's tokenizer kicks far more booty than anything I would have # written. Score one for analysis ;) *************** *** 278,282 **** usage(2, "No options given") ! pck = "hammie.db" good = spam = unknown = None do_filter = usedb = False --- 280,284 ---- usage(2, "No options given") ! pck = DEFAULTDB good = spam = unknown = None do_filter = usedb = False From tim_one@users.sourceforge.net Sat Sep 7 19:22:02 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sat, 07 Sep 2002 11:22:02 -0700 Subject: [Spambayes-checkins] spambayes README.txt,1.9,1.10 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv16167 Modified Files: README.txt Log Message: Some rearrangement. Index: README.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/README.txt,v retrieving revision 1.9 retrieving revision 1.10 diff -C2 -d -r1.9 -r1.10 *** README.txt 7 Sep 2002 16:14:09 -0000 1.9 --- README.txt 7 Sep 2002 18:22:00 -0000 1.10 *************** *** 31,35 **** hammie.py ! A spamassassin-like filter which uses timtoken (below) and classifier (above). Needs to be made faster, especially for writes. --- 31,35 ---- hammie.py ! A spamassassin-like filter which uses tokenizer (below) and classifier (above). Needs to be made faster, especially for writes. *************** *** 49,56 **** tokenize() function of your choosing. - unheader.py - A script to remove unwanted headers from an mbox file. This is mostly - useful to delete headers which incorrectly might bias the results. - GBayes.py A number of tokenizers and a partial test driver. This assumes --- 49,52 ---- *************** *** 73,84 **** Test Data Utilities =================== - rebal.py - Evens out the number of messages in "standard" test data folders (see - below). - cleanarch A script to repair mbox archives by finding "From" lines that should have been escaped, and escaping them. mboxcount.py Count the number of messages (both parseable and unparseable) in --- 69,80 ---- Test Data Utilities =================== cleanarch A script to repair mbox archives by finding "From" lines that should have been escaped, and escaping them. + unheader.py + A script to remove unwanted headers from an mbox file. This is mostly + useful to delete headers which incorrectly might bias the results. + mboxcount.py Count the number of messages (both parseable and unparseable) in *************** *** 89,92 **** --- 85,92 ---- Split an mbox into random pieces in various ways. Tim recommends using "the standard" test data set up instead (see below). + + rebal.py + Evens out the number of messages in "standard" test data folders (see + below). From tim_one@users.sourceforge.net Sat Sep 7 19:38:13 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sat, 07 Sep 2002 11:38:13 -0700 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.1,1.2 timtoken.py,1.8,NONE Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv19837 Modified Files: tokenizer.py Removed Files: timtoken.py Log Message: Removed timtoken.py from the project. tokenizer.py is essentially a copy, but of a somewhat out-of-date version of timtoken at the time it was introduced. The differences are all in comments, and I found those and put them back into tokenizer.py. Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** tokenizer.py 7 Sep 2002 16:14:09 -0000 1.1 --- tokenizer.py 7 Sep 2002 18:38:10 -0000 1.2 *************** *** 352,355 **** --- 352,375 ---- # XXX not to strip HTML from HTML-only msgs should be revisited. + ############################################################################## + # How big should "a word" be? + # + # As I write this, words less than 3 chars are ignored completely, and words + # with more than 12 are special-cased, replaced with a summary "I skipped + # about so-and-so many chars starting with such-and-such a letter" token. + # This makes sense for English if most of the info is in "regular size" + # words. + # + # A test run boosting to 13 had no effect on f-p rate, and did a little + # better or worse than 12 across runs -- overall, no significant difference. + # The database size is smaller at 12, so there's nothing in favor of 13. + # A test at 11 showed a slight but consistent bad effect on the f-n rate + # (lost 12 times, won once, tied 7 times). + # + # A test with no lower bound showed a significant increase in the f-n rate. + # Curious, but not worth digging into. Boosting the lower bound to 4 is a + # worse idea: f-p and f-n rates both suffered significantly then. I didn't + # try testing with lower bound 2. + url_re = re.compile(r""" (https? | ftp) # capture the protocol *************** *** 383,392 **** n = _len(word) - # XXX How big should "a word" be? - # XXX I expect 12 is fine -- a test run boosting to 13 had no effect - # XXX on f-p rate, and did a little better or worse than 12 across - # XXX runs -- overall, no significant difference. It's only "common - # XXX sense" so far driving the exclusion of lengths 1 and 2. - # Make sure this range matches in tokenize(). if 3 <= n <= 12: --- 403,406 ---- *************** *** 449,453 **** # # A bug in this code prevented Content-Transfer-Encoding from getting ! # picked up. Fixing that bug showed that it didn't helpe, so the corrected # code is disabled now (left column without Content-Transfer-Encoding, # right column with it); --- 463,467 ---- # # A bug in this code prevented Content-Transfer-Encoding from getting ! # picked up. Fixing that bug showed that it didn't help, so the corrected # code is disabled now (left column without Content-Transfer-Encoding, # right column with it); *************** *** 567,571 **** def tokenize_headers(self, msg): # Special tagging of header lines. ! # XXX TODO Neil Schemenauer has gotten a good start on this # XXX (pvt email). The headers in my spam and ham corpora are --- 581,585 ---- def tokenize_headers(self, msg): # Special tagging of header lines. ! # XXX TODO Neil Schemenauer has gotten a good start on this # XXX (pvt email). The headers in my spam and ham corpora are --- timtoken.py DELETED --- From tim_one@users.sourceforge.net Sat Sep 7 20:44:34 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sat, 07 Sep 2002 12:44:34 -0700 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.2,1.3 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv2928 Modified Files: tokenizer.py Log Message: Added Neil Schemenauer's IP tokenization of Received: headers, unfortunately disabled for now. Moved textparts() below the massive comments at the start. Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** tokenizer.py 7 Sep 2002 18:38:10 -0000 1.2 --- tokenizer.py 7 Sep 2002 19:44:31 -0000 1.3 *************** *** 5,44 **** from sets import Set - # Find all the text components of the msg. There's no point decoding - # binary blobs (like images). If a multipart/alternative has both plain - # text and HTML versions of a msg, ignore the HTML part: HTML decorations - # have monster-high spam probabilities, and innocent newbies often post - # using HTML. - def textparts(msg): - text = Set() - redundant_html = Set() - for part in msg.walk(): - if part.get_content_type() == 'multipart/alternative': - # Descend this part of the tree, adding any redundant HTML text - # part to redundant_html. - htmlpart = textpart = None - stack = part.get_payload() - while stack: - subpart = stack.pop() - ctype = subpart.get_content_type() - if ctype == 'text/plain': - textpart = subpart - elif ctype == 'text/html': - htmlpart = subpart - elif ctype == 'multipart/related': - stack.extend(subpart.get_payload()) - - if textpart is not None: - text.add(textpart) - if htmlpart is not None: - redundant_html.add(htmlpart) - elif htmlpart is not None: - text.add(htmlpart) - - elif part.get_content_maintype() == 'text': - text.add(part) - - return text - redundant_html - ############################################################################## # To fold case or not to fold case? I didn't want to fold case, because --- 5,8 ---- *************** *** 372,375 **** --- 336,377 ---- # try testing with lower bound 2. + + + # Find all the text components of the msg. There's no point decoding + # binary blobs (like images). If a multipart/alternative has both plain + # text and HTML versions of a msg, ignore the HTML part: HTML decorations + # have monster-high spam probabilities, and innocent newbies often post + # using HTML. + def textparts(msg): + text = Set() + redundant_html = Set() + for part in msg.walk(): + if part.get_content_type() == 'multipart/alternative': + # Descend this part of the tree, adding any redundant HTML text + # part to redundant_html. + htmlpart = textpart = None + stack = part.get_payload() + while stack: + subpart = stack.pop() + ctype = subpart.get_content_type() + if ctype == 'text/plain': + textpart = subpart + elif ctype == 'text/html': + htmlpart = subpart + elif ctype == 'multipart/related': + stack.extend(subpart.get_payload()) + + if textpart is not None: + text.add(textpart) + if htmlpart is not None: + redundant_html.add(htmlpart) + elif htmlpart is not None: + text.add(htmlpart) + + elif part.get_content_maintype() == 'text': + text.add(part) + + return text - redundant_html + url_re = re.compile(r""" (https? | ftp) # capture the protocol *************** *** 393,396 **** --- 395,400 ---- """, re.VERBOSE) + ip_re = re.compile(r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})') + # I'm usually just splitting on whitespace, but for subject lines I want to # break things like "Python/Perl comparison?" up. OTOH, I don't want to *************** *** 640,643 **** --- 644,660 ---- if msg.get('organization', None) is None: yield "bool:noorg" + + # Received: + # Neil Schemenauer reported good results from tokenizing prefixes + # of the embedded IP addresses. + # XXX This is disabled only because it's "too good" when used on + # XXX Tim's mixed-source corpora. + if 0: + for header in msg.get_all("received", ()): + for ip in ip_re.findall(header): + parts = ip.split(".") + for n in range(1, 5): + yield 'received:' + '.'.join(parts[:n]) + # XXX Following is a great idea due to Anthony Baxter. I can't use it From gvanrossum@users.sourceforge.net Sun Sep 8 03:59:45 2002 From: gvanrossum@users.sourceforge.net (Guido van Rossum) Date: Sat, 07 Sep 2002 19:59:45 -0700 Subject: [Spambayes-checkins] spambayes hammie.py,1.13,1.14 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv31886 Modified Files: hammie.py Log Message: Make -u only print the spams. Index: hammie.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammie.py,v retrieving revision 1.13 retrieving revision 1.14 diff -C2 -d -r1.13 -r1.14 *** hammie.py 7 Sep 2002 17:12:22 -0000 1.13 --- hammie.py 8 Sep 2002 02:59:43 -0000 1.14 *************** *** 253,263 **** prob, clues = bayes.spamprob(tokenize(msg), True) isspam = prob >= 0.9 - print "%6d %4.2f %1s" % (i, prob, isspam and "S" or "."), if isspam: spams += 1 print formatclues(clues) else: hams += 1 - print print "Total %d spam, %d ham" % (spams, hams) --- 253,262 ---- prob, clues = bayes.spamprob(tokenize(msg), True) isspam = prob >= 0.9 if isspam: spams += 1 + print "%6s %4.2f %1s" % (i, prob, isspam and "S" or "."), print formatclues(clues) else: hams += 1 print "Total %d spam, %d ham" % (spams, hams) From tim_one@users.sourceforge.net Sun Sep 8 04:17:33 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sat, 07 Sep 2002 20:17:33 -0700 Subject: [Spambayes-checkins] spambayes classifier.py,1.4,1.5 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv2722 Modified Files: classifier.py Log Message: spamprob(): If the caller asked for the clues ( pairs), sort them by prob. Index: classifier.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/classifier.py,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** classifier.py 7 Sep 2002 05:11:30 -0000 1.4 --- classifier.py 8 Sep 2002 03:17:31 -0000 1.5 *************** *** 323,326 **** --- 323,327 ---- prob = prob_product / (prob_product + inverse_prob_product) if evidence: + clues.sort(lambda a, b: cmp(a[1], b[1])) return prob, clues else: *************** *** 559,562 **** --- 560,571 ---- elif prob > MAX_SPAMPROB: prob = MAX_SPAMPROB + + + ## if prob != 0.5: + ## confbias = 0.01 / (record.hamcount + record.spamcount) + ## if prob > 0.5: + ## prob = max(0.5, prob - confbias) + ## else: + ## prob = min(0.5, prob + confbias) if record.spamprob != prob: From gvanrossum@users.sourceforge.net Sun Sep 8 04:20:20 2002 From: gvanrossum@users.sourceforge.net (Guido van Rossum) Date: Sat, 07 Sep 2002 20:20:20 -0700 Subject: [Spambayes-checkins] spambayes hammie.py,1.14,1.15 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv3250 Modified Files: hammie.py Log Message: No need to sort the clues any more (classifier.py does that now). Index: hammie.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammie.py,v retrieving revision 1.14 retrieving revision 1.15 diff -C2 -d -r1.14 -r1.15 *** hammie.py 8 Sep 2002 02:59:43 -0000 1.14 --- hammie.py 8 Sep 2002 03:20:18 -0000 1.15 *************** *** 226,232 **** def formatclues(clues, sep="; "): """Format the clues into something readable.""" ! lst = [(prob, word) for word, prob in clues] ! lst.sort() ! return sep.join(["%r: %.2f" % (word, prob) for prob, word in lst]) def filter(bayes, input, output): --- 226,230 ---- def formatclues(clues, sep="; "): """Format the clues into something readable.""" ! return sep.join(["%r: %.2f" % (word, prob) for word, prob in clues]) def filter(bayes, input, output): From tim_one@users.sourceforge.net Sun Sep 8 09:08:04 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sun, 08 Sep 2002 01:08:04 -0700 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.3,1.4 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv24720 Modified Files: tokenizer.py Log Message: Add results from latest experiments with tokenization and HTML stripping. Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** tokenizer.py 7 Sep 2002 19:44:31 -0000 1.3 --- tokenizer.py 8 Sep 2002 08:08:02 -0000 1.4 *************** *** 205,209 **** # # total unique fn went from 292 to 302 ! ############################################################################## --- 205,299 ---- # # total unique fn went from 292 to 302 ! # ! # Later: Here's another tokenization scheme with more promise. ! # ! # fold case, ignore punctuation, strip a trailing 's' from words (to ! # stop Guido griping about "hotel" and "hotels" getting scored as ! # distinct clues ) and save both word bigrams and word unigrams ! # ! # This was the code: ! # ! # # Tokenize everything in the body. ! # lastw = '' ! # for w in word_re.findall(text): ! # n = len(w) ! # # Make sure this range matches in tokenize_word(). ! # if 3 <= n <= 12: ! # if w[-1] == 's': ! # w = w[:-1] ! # yield w ! # if lastw: ! # yield lastw + w ! # lastw = w + ' ' ! # ! # elif n >= 3: ! # lastw = '' ! # for t in tokenize_word(w): ! # yield t ! # ! # where ! # ! # word_re = re.compile(r"[\w$\-\x80-\xff]+") ! # ! # This at least doubled the process size. It helped the f-n rate ! # significantly, but probably hurt the f-p rate (the f-p rate is too low ! # with only 4000 hams per run to be confident about changes of such small ! # *absolute* magnitude -- 0.025% is a single message in the f-p table): ! # ! # false positive percentages ! # 0.000 0.000 tied ! # 0.000 0.075 lost +(was 0) ! # 0.050 0.125 lost +150.00% ! # 0.025 0.000 won -100.00% ! # 0.075 0.025 won -66.67% ! # 0.000 0.050 lost +(was 0) ! # 0.100 0.175 lost +75.00% ! # 0.050 0.050 tied ! # 0.025 0.050 lost +100.00% ! # 0.025 0.000 won -100.00% ! # 0.050 0.125 lost +150.00% ! # 0.050 0.025 won -50.00% ! # 0.050 0.050 tied ! # 0.000 0.025 lost +(was 0) ! # 0.000 0.025 lost +(was 0) ! # 0.075 0.050 won -33.33% ! # 0.025 0.050 lost +100.00% ! # 0.000 0.000 tied ! # 0.025 0.100 lost +300.00% ! # 0.050 0.150 lost +200.00% ! # ! # won 5 times ! # tied 4 times ! # lost 11 times ! # ! # total unique fp went from 13 to 21 ! # ! # false negative percentages ! # 0.327 0.218 won -33.33% ! # 0.400 0.218 won -45.50% ! # 0.327 0.218 won -33.33% ! # 0.691 0.691 tied ! # 0.545 0.327 won -40.00% ! # 0.291 0.218 won -25.09% ! # 0.218 0.291 lost +33.49% ! # 0.654 0.473 won -27.68% ! # 0.364 0.327 won -10.16% ! # 0.291 0.182 won -37.46% ! # 0.327 0.254 won -22.32% ! # 0.691 0.509 won -26.34% ! # 0.582 0.473 won -18.73% ! # 0.291 0.255 won -12.37% ! # 0.364 0.218 won -40.11% ! # 0.436 0.327 won -25.00% ! # 0.436 0.473 lost +8.49% ! # 0.218 0.218 tied ! # 0.291 0.255 won -12.37% ! # 0.254 0.364 lost +43.31% ! # ! # won 15 times ! # tied 2 times ! # lost 3 times ! # ! # total unique fn went from 106 to 94 ############################################################################## *************** *** 313,318 **** # do that part. However, even after stripping tags, the rates above show that # at least 98% of spams are still correctly identified as spam. ! # XXX So, if another way is found to slash the f-n rate, the decision here ! # XXX not to strip HTML from HTML-only msgs should be revisited. ############################################################################## --- 403,471 ---- # do that part. However, even after stripping tags, the rates above show that # at least 98% of spams are still correctly identified as spam. ! # ! # So, if another way is found to slash the f-n rate, the decision here not ! # to strip HTML from HTML-only msgs should be revisited. ! # ! # Later, after the f-n rate got slashed via other means: ! # ! # false positive percentages ! # 0.000 0.000 tied ! # 0.000 0.000 tied ! # 0.050 0.075 lost +50.00% ! # 0.025 0.025 tied ! # 0.075 0.025 won -66.67% ! # 0.000 0.000 tied ! # 0.100 0.100 tied ! # 0.050 0.075 lost +50.00% ! # 0.025 0.025 tied ! # 0.025 0.000 won -100.00% ! # 0.050 0.075 lost +50.00% ! # 0.050 0.050 tied ! # 0.050 0.025 won -50.00% ! # 0.000 0.000 tied ! # 0.000 0.000 tied ! # 0.075 0.075 tied ! # 0.025 0.025 tied ! # 0.000 0.000 tied ! # 0.025 0.025 tied ! # 0.050 0.050 tied ! # ! # won 3 times ! # tied 14 times ! # lost 3 times ! # ! # total unique fp went from 13 to 11 ! # ! # false negative percentages ! # 0.327 0.400 lost +22.32% ! # 0.400 0.400 tied ! # 0.327 0.473 lost +44.65% ! # 0.691 0.654 won -5.35% ! # 0.545 0.473 won -13.21% ! # 0.291 0.364 lost +25.09% ! # 0.218 0.291 lost +33.49% ! # 0.654 0.654 tied ! # 0.364 0.473 lost +29.95% ! # 0.291 0.327 lost +12.37% ! # 0.327 0.291 won -11.01% ! # 0.691 0.654 won -5.35% ! # 0.582 0.655 lost +12.54% ! # 0.291 0.400 lost +37.46% ! # 0.364 0.436 lost +19.78% ! # 0.436 0.582 lost +33.49% ! # 0.436 0.364 won -16.51% ! # 0.218 0.291 lost +33.49% ! # 0.291 0.400 lost +37.46% ! # 0.254 0.327 lost +28.74% ! # ! # won 5 times ! # tied 2 times ! # lost 13 times ! # ! # total unique fn went from 106 to 122 ! # ! # So HTML decorations are still a significant clue when the ham is composed ! # of c.l.py traffic. Again, this should be revisited if the f-n rate is ! # slashed again. ############################################################################## From nascheme@users.sourceforge.net Sun Sep 8 13:55:36 2002 From: nascheme@users.sourceforge.net (Neil Schemenauer) Date: Sun, 08 Sep 2002 05:55:36 -0700 Subject: [Spambayes-checkins] spambayes splitndirs.py,NONE,1.1 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv14232 Added Files: splitndirs.py Log Message: Like splitn.py but puts each message in a file, suitable for timtest.py. I don't know what the assert is trying to do and it fails on my spam box so I left it out. --- NEW FILE: splitndirs.py --- #! /usr/bin/env python """Split an mbox into N random directories of files. Usage: %(program)s [-h] [-s seed] [-v] -n N sourcembox outdirbase Options: -h / --help Print this help message and exit -s seed Seed the random number generator with seed (an integer). By default, use system time at startup to seed. -v Verbose. Displays a period for each 100 messages parsed. May display other stuff. -n N The number of output mboxes desired. This is required. Arguments: sourcembox The mbox to split. outdirbase The base path + name prefix for each of the N output dirs. Output files have names of the form outdirbase + ("Set%%d/%%d" %% (i, n)) Example: %(program)s -s 123 -n5 Data/spam.mbox Data/Spam/Set produces 5 directories, named Data/Spam/Set1 through Data/Spam/Set5. Each contains a random selection of the messages in spam.mbox, and together they contain every message in spam.mbox exactly once. Each has approximately the same number of messages. spam.mbox is not altered. In addition, the seed for the random number generator is forced to 123, so that while the split is random, it's reproducible. """ import sys import os import random import mailbox import email import getopt program = sys.argv[0] def usage(code, msg=''): print >> sys.stderr, __doc__ % globals() if msg: print >> sys.stderr, msg sys.exit(code) def _factory(fp): try: return email.message_from_file(fp) except email.Errors.MessageParseError: return '' def main(): try: opts, args = getopt.getopt(sys.argv[1:], 'hn:s:v', ['help']) except getopt.error, msg: usage(1, msg) n = None verbose = False for opt, arg in opts: if opt in ('-h', '--help'): usage(0) elif opt == '-s': random.seed(int(arg)) elif opt == '-n': n = int(arg) elif opt == '-v': verbose = True if n is None or n <= 1: usage(1, "an -n value > 1 is required") if len(args) != 2: usage(1, "input mbox name and output base path are required") inputpath, outputbasepath = args infile = file(inputpath, 'rb') outdirs = [outputbasepath + ("%d" % i) for i in range(1, n+1)] for dir in outdirs: if not os.path.isdir(dir): os.makedirs(dir) mbox = mailbox.PortableUnixMailbox(infile, _factory) counter = 0 for msg in mbox: i = random.randrange(n) astext = str(msg) #assert astext.endswith('\n') counter += 1 msgfile = open('%s/%d' % (outdirs[i], counter), 'wb') msgfile.write(astext) msgfile.close() if verbose: if counter % 100 == 0: print '.', if verbose: print print counter, "messages split into", n, "directories" infile.close() if __name__ == '__main__': main() From nascheme@users.sourceforge.net Sun Sep 8 18:10:06 2002 From: nascheme@users.sourceforge.net (Neil Schemenauer) Date: Sun, 08 Sep 2002 10:10:06 -0700 Subject: [Spambayes-checkins] spambayes cmp.py,1.2,1.3 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv10946 Modified Files: cmp.py Log Message: make work for NSETS != 5 Index: cmp.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/cmp.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** cmp.py 6 Sep 2002 04:25:45 -0000 1.2 --- cmp.py 8 Sep 2002 17:10:03 -0000 1.3 *************** *** 10,15 **** f1n, f2n = sys.argv[1:3] - NSETS = 5 - # Return # (list of all f-p rates, --- 10,13 ---- *************** *** 21,29 **** fns = [] fps = [] ! for block in range(NSETS): ! # Skip, e.g., ! # Training on Data/Ham/Set1 & Data/Spam/Set1 ... 4000 hams & 2750 spams ! f.readline() ! for inner in range(NSETS - 1): # A line with an f-p rate and an f-n rate. p, n = map(float, f.readline().split()) --- 19,27 ---- fns = [] fps = [] ! while 1: ! line = f.readline() ! if line.startswith('total'): ! break ! if not line.startswith('Training'): # A line with an f-p rate and an f-n rate. p, n = map(float, f.readline().split()) *************** *** 33,37 **** # "total false pos 8 0.04" # "total false neg 249 1.81090909091" ! fptot = int(f.readline().split()[-2]) fntot = int(f.readline().split()[-2]) return fps, fns, fptot, fntot --- 31,35 ---- # "total false pos 8 0.04" # "total false neg 249 1.81090909091" ! fptot = int(line.split()[-2]) fntot = int(f.readline().split()[-2]) return fps, fns, fptot, fntot From nascheme@users.sourceforge.net Sun Sep 8 18:18:44 2002 From: nascheme@users.sourceforge.net (Neil Schemenauer) Date: Sun, 08 Sep 2002 10:18:44 -0700 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.4,1.5 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv12815 Modified Files: tokenizer.py Log Message: smarter received header processing. Grab the 'from' hostname and IP and ignore the rest. Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** tokenizer.py 8 Sep 2002 08:08:02 -0000 1.4 --- tokenizer.py 8 Sep 2002 17:18:41 -0000 1.5 *************** *** 548,552 **** """, re.VERBOSE) ! ip_re = re.compile(r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})') # I'm usually just splitting on whitespace, but for subject lines I want to --- 548,553 ---- """, re.VERBOSE) ! received_host_re = re.compile(r'from (\S+)\s') ! received_ip_re = re.compile(r'\s[[(]((\d{1,3}\.?){4})[\])]') # I'm usually just splitting on whitespace, but for subject lines I want to *************** *** 708,711 **** --- 709,721 ---- yield 'content-transfer-encoding:' + x.lower() + def breakdown_host(host): + parts = host.split('.') + for i in range(1, len(parts) + 1): + yield '.'.join(parts[-i:]) + + def breakdown_ipaddr(ipaddr): + parts = ipaddr.split('.') + for i in range(1, 5): + yield '.'.join(parts[:i]) class Tokenizer: *************** *** 805,813 **** if 0: for header in msg.get_all("received", ()): ! for ip in ip_re.findall(header): ! parts = ip.split(".") ! for n in range(1, 5): ! yield 'received:' + '.'.join(parts[:n]) ! # XXX Following is a great idea due to Anthony Baxter. I can't use it --- 815,824 ---- if 0: for header in msg.get_all("received", ()): ! for pat, breakdown in [(received_host_re, breakdown_host), ! (received_ip_re, breakdown_ipaddr)]: ! m = pat.search(header) ! if m: ! for tok in breakdown(m.group(1).lower()): ! yield 'received:' + tok # XXX Following is a great idea due to Anthony Baxter. I can't use it From tim_one@users.sourceforge.net Sun Sep 8 18:41:59 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sun, 08 Sep 2002 10:41:59 -0700 Subject: [Spambayes-checkins] spambayes splitn.py,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv17562 Modified Files: splitn.py Log Message: Removed pointless assert; it failed for Neil. Index: splitn.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/splitn.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** splitn.py 5 Sep 2002 16:16:43 -0000 1.1 --- splitn.py 8 Sep 2002 17:41:56 -0000 1.2 *************** *** 94,98 **** i = random.randrange(n) astext = str(msg) - assert astext.endswith('\n') outfiles[i].write(astext) counter += 1 --- 94,97 ---- From tim_one@users.sourceforge.net Sun Sep 8 18:46:16 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sun, 08 Sep 2002 10:46:16 -0700 Subject: [Spambayes-checkins] spambayes README.txt,1.10,1.11 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv18658 Modified Files: README.txt Log Message: Blurb about Neil's splitndirs.py. Index: README.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/README.txt,v retrieving revision 1.10 retrieving revision 1.11 diff -C2 -d -r1.10 -r1.11 *** README.txt 7 Sep 2002 18:22:00 -0000 1.10 --- README.txt 8 Sep 2002 17:46:14 -0000 1.11 *************** *** 86,92 **** using "the standard" test data set up instead (see below). rebal.py Evens out the number of messages in "standard" test data folders (see ! below). --- 86,98 ---- using "the standard" test data set up instead (see below). + splitndirs.py + Like splitn.py (above), but splits an mbox into one message per file in + "the standard" directory structure (see below). This does an + approximate split; rebal.by (below) can be used afterwards to even out + the number of messages per folder. + rebal.py Evens out the number of messages in "standard" test data folders (see ! below). Needs generalization (e.g., Ham and 4000 are hardcoded now). From tim_one@users.sourceforge.net Sun Sep 8 18:50:51 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sun, 08 Sep 2002 10:50:51 -0700 Subject: [Spambayes-checkins] spambayes cmp.py,1.3,1.4 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv19567 Modified Files: cmp.py Log Message: dump(): tiny simplification of print format. Index: cmp.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/cmp.py,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** cmp.py 8 Sep 2002 17:10:03 -0000 1.3 --- cmp.py 8 Sep 2002 17:50:49 -0000 1.4 *************** *** 55,59 **** print for t in "won", "tied", "lost": ! print "%-4s %2d %s" % (t, alltags.count(t), "times") print --- 55,59 ---- print for t in "won", "tied", "lost": ! print "%-4s %2d times" % (t, alltags.count(t)) print From tim_one@users.sourceforge.net Sun Sep 8 19:21:26 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sun, 08 Sep 2002 11:21:26 -0700 Subject: [Spambayes-checkins] spambayes cmp.py,1.4,1.5 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv26263 Modified Files: cmp.py Log Message: Someone introduced a bug that resulted in half the f-p and f-n rates getting ignored (every 2nd line of that type got skipped). Repaired it. Index: cmp.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/cmp.py,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** cmp.py 8 Sep 2002 17:50:49 -0000 1.4 --- cmp.py 8 Sep 2002 18:21:24 -0000 1.5 *************** *** 25,29 **** if not line.startswith('Training'): # A line with an f-p rate and an f-n rate. ! p, n = map(float, f.readline().split()) fps.append(p) fns.append(n) --- 25,29 ---- if not line.startswith('Training'): # A line with an f-p rate and an f-n rate. ! p, n = map(float, line.split()) fps.append(p) fns.append(n) From tim_one@users.sourceforge.net Sun Sep 8 19:39:01 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sun, 08 Sep 2002 11:39:01 -0700 Subject: [Spambayes-checkins] spambayes cmp.py,1.5,1.6 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv30785 Modified Files: cmp.py Log Message: Compute and display the %change for total unique fn and fp too. Index: cmp.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/cmp.py,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** cmp.py 8 Sep 2002 18:21:24 -0000 1.5 --- cmp.py 8 Sep 2002 18:38:59 -0000 1.6 *************** *** 66,73 **** print "false positive percentages" dump(fp1, fp2) ! print "total unique fp went from", fptot1, "to", fptot2 print print "false negative percentages" dump(fn1, fn2) ! print "total unique fn went from", fntot1, "to", fntot2 --- 66,73 ---- print "false positive percentages" dump(fp1, fp2) ! print "total unique fp went from", fptot1, "to", fptot2, tag(fptot1, fptot2) print print "false negative percentages" dump(fn1, fn2) ! print "total unique fn went from", fntot1, "to", fntot2, tag(fntot1, fntot2) From tim_one@users.sourceforge.net Sun Sep 8 19:54:12 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sun, 08 Sep 2002 11:54:12 -0700 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.5,1.6 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv32497 Modified Files: tokenizer.py Log Message: tokenize(): Stop distinguishing Content-XYZ thingies in the headers from instances in lower-level MIME sections. In all, doing so appears to be just another way of warping the tokenizer to c.l.py's extreme hatred of HTML. For example, '>content-type:text/plain' (lower-level instance) has a spamprob of 0.85 in my data, but 'content-type:text/plain' (top-level instance) has spamprob less than 0.25. A few examples Guido posted suggest this distinction does more harm on his data than it does good on mine. On mine, getting rid of the distinction makes a tiny difference in the f-n rates; note that an f-n boost from 0.327% to 0.364% represents a single msg in my ~2750-msg spam sets: false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.050 0.050 tied 0.025 0.025 tied 0.075 0.075 tied 0.000 0.000 tied 0.100 0.075 won -25.00% 0.050 0.075 lost +50.00% 0.025 0.025 tied 0.025 0.025 tied 0.050 0.050 tied 0.050 0.050 tied 0.050 0.050 tied 0.000 0.000 tied 0.000 0.000 tied 0.075 0.075 tied 0.025 0.025 tied 0.000 0.000 tied 0.025 0.025 tied 0.050 0.050 tied won 1 times tied 18 times lost 1 times total unique fp went from 13 to 12 won -7.69% false negative percentages 0.327 0.327 tied 0.400 0.400 tied 0.327 0.364 lost +11.31% 0.691 0.691 tied 0.545 0.545 tied 0.291 0.291 tied 0.218 0.291 lost +33.49% 0.654 0.618 won -5.50% 0.364 0.436 lost +19.78% 0.291 0.327 lost +12.37% 0.327 0.364 lost +11.31% 0.691 0.691 tied 0.582 0.618 lost +6.19% 0.291 0.291 tied 0.364 0.291 won -20.05% 0.436 0.436 tied 0.436 0.473 lost +8.49% 0.218 0.218 tied 0.291 0.291 tied 0.254 0.254 tied won 2 times tied 11 times lost 7 times total unique fn went from 106 to 110 lost +3.77% Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** tokenizer.py 8 Sep 2002 17:18:41 -0000 1.5 --- tokenizer.py 8 Sep 2002 18:54:09 -0000 1.6 *************** *** 757,765 **** # Content-{Type, Disposition} and their params, and charsets. - t = '' for x in msg.walk(): for w in crack_content_xyz(x): ! yield t + w ! t = '>' # Subject: --- 757,763 ---- # Content-{Type, Disposition} and their params, and charsets. for x in msg.walk(): for w in crack_content_xyz(x): ! yield w # Subject: From tim_one@users.sourceforge.net Sun Sep 8 22:08:18 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sun, 08 Sep 2002 14:08:18 -0700 Subject: [Spambayes-checkins] spambayes timtest.py,1.12,1.13 tokenizer.py,1.6,1.7 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv4417 Modified Files: timtest.py tokenizer.py Log Message: tokenize_word(): Stopped splitting the y in x@y on '.'. Improved the f-n rate. The big loser for f-p was a message consisting entirely of "Thanks guys", posted from an x@y address where y had a 0.99 spamprob, but where y split in pieces had two significantly lower spamprobs. Index: timtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timtest.py,v retrieving revision 1.12 retrieving revision 1.13 diff -C2 -d -r1.12 -r1.13 *** timtest.py 7 Sep 2002 16:15:45 -0000 1.12 --- timtest.py 8 Sep 2002 21:08:16 -0000 1.13 *************** *** 107,111 **** random.seed(hash(directory)) random.shuffle(all) ! for fname in all[-500:]: yield Msg(directory, fname) --- 107,111 ---- random.seed(hash(directory)) random.shuffle(all) ! for fname in all[-1500:-1000:]: yield Msg(directory, fname) Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** tokenizer.py 8 Sep 2002 18:54:09 -0000 1.6 --- tokenizer.py 8 Sep 2002 21:08:16 -0000 1.7 *************** *** 569,577 **** # Don't want to skip embedded email addresses. if n < 40 and '.' in word and word.count('@') == 1: p1, p2 = word.split('@') yield 'email name:' + p1 ! for piece in p2.split('.'): ! yield 'email addr:' + piece # If there are any high-bit chars, --- 569,578 ---- # Don't want to skip embedded email addresses. + # An earlier scheme also split up the y in x@y on '.'. Not splitting + # improved the f-n rate; the f-p rate didn't care either way. if n < 40 and '.' in word and word.count('@') == 1: p1, p2 = word.split('@') yield 'email name:' + p1 ! yield 'email addr:' + p2 # If there are any high-bit chars, From tim_one@users.sourceforge.net Sun Sep 8 22:29:07 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sun, 08 Sep 2002 14:29:07 -0700 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.7,1.8 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv10235 Modified Files: tokenizer.py Log Message: Fixed grammar in a comment, just because I forgot to post the new rates after the last checkin (to simplify parsing of email addresses): false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.050 0.100 lost +100.00% 0.025 0.025 tied 0.075 0.050 won -33.33% 0.000 0.000 tied 0.075 0.075 tied 0.075 0.050 won -33.33% 0.025 0.025 tied 0.025 0.025 tied 0.050 0.050 tied 0.050 0.050 tied 0.050 0.050 tied 0.000 0.000 tied 0.000 0.000 tied 0.075 0.075 tied 0.025 0.025 tied 0.000 0.000 tied 0.025 0.025 tied 0.050 0.100 lost +100.00% won 2 times tied 16 times lost 2 times total unique fp went from 12 to 14 lost +16.67% false negative percentages 0.327 0.291 won -11.01% 0.400 0.364 won -9.00% 0.364 0.254 won -30.22% 0.691 0.582 won -15.77% 0.545 0.545 tied 0.291 0.218 won -25.09% 0.291 0.218 won -25.09% 0.618 0.654 lost +5.83% 0.436 0.364 won -16.51% 0.327 0.255 won -22.02% 0.364 0.400 lost +9.89% 0.691 0.654 won -5.35% 0.618 0.618 tied 0.291 0.291 tied 0.291 0.291 tied 0.436 0.436 tied 0.473 0.436 won -7.82% 0.218 0.218 tied 0.291 0.255 won -12.37% 0.254 0.182 won -28.35% won 12 times tied 6 times lost 2 times total unique fn went from 110 to 101 won -8.18% Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.7 retrieving revision 1.8 diff -C2 -d -r1.7 -r1.8 *** tokenizer.py 8 Sep 2002 21:08:16 -0000 1.7 --- tokenizer.py 8 Sep 2002 21:29:05 -0000 1.8 *************** *** 604,609 **** # all the charsets # ! # This has huge benefit for the f-n rate, and virtually none on the f-p rate, ! # although it does reduce the variance of the f-p rate across different # training sets (really marginal msgs, like a brief HTML msg saying just # "unsubscribe me", are almost always tagged as spam now; before they were --- 604,609 ---- # all the charsets # ! # This has huge benefit for the f-n rate, and virtually no effect on the f-p ! # rate, although it does reduce the variance of the f-p rate across different # training sets (really marginal msgs, like a brief HTML msg saying just # "unsubscribe me", are almost always tagged as spam now; before they were From tim_one@users.sourceforge.net Mon Sep 9 00:48:52 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sun, 08 Sep 2002 16:48:52 -0700 Subject: [Spambayes-checkins] spambayes timtest.py,1.13,1.14 tokenizer.py,1.8,1.9 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv10431 Modified Files: timtest.py tokenizer.py Log Message: Tried to treat src= params specially. It made no difference, so left the code but commented it out. Refactored code to parse "file names" as part of this, and left that change in. Index: timtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timtest.py,v retrieving revision 1.13 retrieving revision 1.14 diff -C2 -d -r1.13 -r1.14 *** timtest.py 8 Sep 2002 21:08:16 -0000 1.13 --- timtest.py 8 Sep 2002 23:48:50 -0000 1.14 *************** *** 141,147 **** self.trained_spam_hist = Hist(self.nbuckets) ! #f = file('w.pik', 'wb') ! #pickle.dump(self.classifier, f, 1) ! #f.close() #import sys #sys.exit(0) --- 141,147 ---- self.trained_spam_hist = Hist(self.nbuckets) ! f = file('w.pik', 'wb') ! pickle.dump(self.classifier, f, 1) ! f.close() #import sys #sys.exit(0) Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.8 retrieving revision 1.9 diff -C2 -d -r1.8 -r1.9 *** tokenizer.py 8 Sep 2002 21:29:05 -0000 1.8 --- tokenizer.py 8 Sep 2002 23:48:50 -0000 1.9 *************** *** 558,561 **** --- 558,587 ---- subject_word_re = re.compile(r"[\w\x80-\xff$.%]+") + # Anthony Baxter reported goodness from cracking src params. + # Finding a src= thingie is complicated if we insist it appear in an + # img or iframe tag, so this approximates reality with a fast and + # non-stack-blowing simple regexp. + src_re = re.compile(r""" + \s + src=['"] + (?!https?:) # we suck out http thingies via a different gimmick + ([^'"]{1,128}) # capture the guts, but don't go wild + ['"] + """, re.VERBOSE) + + fname_sep_re = re.compile(r'[/\\:]') + + def crack_filename(fname): + yield "fname:" + fname + components = fname_sep_re.split(fname) + morethan1 = len(components) > 1 + for component in components: + if morethan1: + yield "fname comp:" + component + pieces = urlsep_re.split(component) + if len(pieces) > 1: + for piece in pieces: + yield "fname piece:" + piece + def tokenize_word(word, _len=len): n = _len(word) *************** *** 701,707 **** fname = msg.get_filename() if fname is not None: ! for x in fname.lower().split('/'): ! for y in x.split('.'): ! yield 'filename:' + y if 0: # disabled; see comment before function --- 727,732 ---- fname = msg.get_filename() if fname is not None: ! for x in crack_filename(fname): ! yield 'filename:' + x if 0: # disabled; see comment before function *************** *** 874,877 **** --- 899,913 ---- for chunk in urlsep_re.split(piece): yield prefix + chunk + + # Anthony Baxter reported goodness from tokenizing src= params. + # XXX This made no difference in my tests: both error rates + # XXX across 20 runs were identical before and after. I suspect + # XXX this is because Anthony got most good out of the http + # XXX thingies in

, but we + # XXX picked those up in the last step (in src params and + # XXX everywhere else). So this code is commented out. + ## for fname in src_re.findall(text): + ## for x in crack_filename(fname): + ## yield "src:" + x # Remove HTML/XML tags if it's a plain text message. From tim_one@users.sourceforge.net Mon Sep 9 00:49:53 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sun, 08 Sep 2002 16:49:53 -0700 Subject: [Spambayes-checkins] spambayes timtest.py,1.14,1.15 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv10775 Modified Files: timtest.py Log Message: Oops -- checked in a private change by mistake. Backing it out. Index: timtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timtest.py,v retrieving revision 1.14 retrieving revision 1.15 diff -C2 -d -r1.14 -r1.15 *** timtest.py 8 Sep 2002 23:48:50 -0000 1.14 --- timtest.py 8 Sep 2002 23:49:51 -0000 1.15 *************** *** 141,147 **** self.trained_spam_hist = Hist(self.nbuckets) ! f = file('w.pik', 'wb') ! pickle.dump(self.classifier, f, 1) ! f.close() #import sys #sys.exit(0) --- 141,147 ---- self.trained_spam_hist = Hist(self.nbuckets) ! #f = file('w.pik', 'wb') ! #pickle.dump(self.classifier, f, 1) ! #f.close() #import sys #sys.exit(0) From gvanrossum@users.sourceforge.net Mon Sep 9 00:53:25 2002 From: gvanrossum@users.sourceforge.net (Guido van Rossum) Date: Sun, 08 Sep 2002 16:53:25 -0700 Subject: [Spambayes-checkins] spambayes README.txt,1.11,1.12 GBayes.py,1.1,NONE Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv11378 Modified Files: README.txt Removed Files: GBayes.py Log Message: Get rid of GBayes.py. It was old and the relevant pieces are now in hammie.py. Index: README.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/README.txt,v retrieving revision 1.11 retrieving revision 1.12 diff -C2 -d -r1.11 -r1.12 *** README.txt 8 Sep 2002 17:46:14 -0000 1.11 --- README.txt 8 Sep 2002 23:53:23 -0000 1.12 *************** *** 49,57 **** tokenize() function of your choosing. - GBayes.py - A number of tokenizers and a partial test driver. This assumes - an mbox format. Could stand massive refactoring. I don't think - it's been kept up to date. - Test Utilities --- 49,52 ---- --- GBayes.py DELETED --- From tim_one@users.sourceforge.net Mon Sep 9 05:56:14 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sun, 08 Sep 2002 21:56:14 -0700 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.9,1.10 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv29522 Modified Files: tokenizer.py Log Message: Pure win, from enabling Anthony's "count the mere # of various header lines, case-sensitively" on a small subset of header lines. This avoids all the header lines the union of Greg and Barry told me *might* be artifacts of Mailman and/or BruceG's (the spam collector's) email setup. It's an open question how much this may merely be discriminating newsgroup traffic from non-newsgroup mail, but I also left out what I thought were obvious newsgroupy headers (like References:). The presence of X-Complaints-To happens to be a very strong discriminator in my data, and accounts for redeeming 6 of the 14 previous false positives. false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.100 0.050 won -50.00% 0.025 0.000 won -100.00% 0.050 0.025 won -50.00% 0.000 0.000 tied 0.075 0.075 tied 0.050 0.025 won -50.00% 0.025 0.025 tied 0.025 0.000 won -100.00% 0.050 0.050 tied 0.050 0.000 won -100.00% 0.050 0.025 won -50.00% 0.000 0.000 tied 0.000 0.000 tied 0.075 0.050 won -33.33% 0.025 0.025 tied 0.000 0.000 tied 0.025 0.025 tied 0.100 0.050 won -50.00% won 9 times tied 11 times lost 0 times total unique fp went from 14 to 8 won -42.86% false negative percentages 0.291 0.255 won -12.37% 0.364 0.364 tied 0.254 0.254 tied 0.582 0.509 won -12.54% 0.545 0.436 won -20.00% 0.218 0.218 tied 0.218 0.182 won -16.51% 0.654 0.582 won -11.01% 0.364 0.327 won -10.16% 0.255 0.255 tied 0.400 0.254 won -36.50% 0.654 0.582 won -11.01% 0.618 0.545 won -11.81% 0.291 0.255 won -12.37% 0.291 0.291 tied 0.436 0.400 won -8.26% 0.436 0.291 won -33.26% 0.218 0.218 tied 0.255 0.218 won -14.51% 0.182 0.145 won -20.33% won 14 times tied 6 times lost 0 times total unique fn went from 101 to 89 won -11.88% Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.9 retrieving revision 1.10 diff -C2 -d -r1.9 -r1.10 *** tokenizer.py 8 Sep 2002 23:48:50 -0000 1.9 --- tokenizer.py 9 Sep 2002 04:56:12 -0000 1.10 *************** *** 745,748 **** --- 745,770 ---- yield '.'.join(parts[:i]) + # We're merely going to count the number of these, and case-sensitively. + safe_headers = Set(""" + abuse-reports-to + date + errors-to + from + importance + in-reply-to + message-id + mime-version + organization + received + reply-to + return-path + subject + to + user-agent + x-abuse-info + x-complaints-to + x-face + """.split()) + class Tokenizer: *************** *** 823,835 **** yield prefix + ' '.join(x.split()) - # Organization: - # Oddly enough, tokenizing this doesn't make any difference to - # results. However, noting its mere absence is strong enough - # to give a tiny improvement in the f-n rate, and since - # recording that requires only one token across the whole - # database, the cost is also tiny. - if msg.get('organization', None) is None: - yield "bool:noorg" - # Received: # Neil Schemenauer reported good results from tokenizing prefixes --- 845,848 ---- *************** *** 867,870 **** --- 880,891 ---- ##for x in x2n.items(): ## yield "header:%s:%d" % x + + # Do a "safe" approximation to that for now. + x2n = {} + for x in msg.keys(): + if x.lower() in safe_headers: + x2n[x] = x2n.get(x, 0) + 1 + for x in x2n.items(): + yield "header:%s:%d" % x def tokenize_body(self, msg): From tim_one@users.sourceforge.net Mon Sep 9 17:19:41 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Mon, 09 Sep 2002 09:19:41 -0700 Subject: [Spambayes-checkins] spambayes Options.py,NONE,1.1 README.txt,1.12,1.13 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv20464 Modified Files: README.txt Added Files: Options.py Log Message: Options.options is intended to be shared global state, for customizing what the classifier and tokenizer do in a controlled and reportable way (note that options.display() produces a nice string spelling out the options in effect). Nothing uses this yet. --- NEW FILE: Options.py --- from sets import Set # Descriptions of options. # Empty lines, and lines starting with a blank, are ignored. # A line starting with a non-blank character is of the form: # option_name "default" default_value # option_name must not contain whitespace # default_value must be eval'able. option_descriptions = """ retain_pure_html_tags default False By default, HTML tags are stripped from pure text/html messages. Set retain_pure_html_tags True to retain HTML tags in this case. """ class OptionsClass(dict): def __init__(self): self.optnames = Set() for line in option_descriptions.split('\n'): if not line or line.startswith(' '): continue i = line.index(' ') name = line[:i] self.optnames.add(name) i = line.index(' default ', i) self.setopt(name, eval(line[i+9:], {})) def _checkname(self, name): if name not in self.optnames: raise KeyError("there's no option named %r" % name) def setopt(self, name, value): self._checkname(name) self[name] = value def display(self): """Return a string showing current option values.""" result = ['Option values:\n'] width = max([len(name) for name in self.keys()]) items = self.items() items.sort() for name, value in items: result.append(' %-*s: %r\n' % (width, name, value)) return ''.join(result) options = OptionsClass() Index: README.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/README.txt,v retrieving revision 1.12 retrieving revision 1.13 diff -C2 -d -r1.12 -r1.13 *** README.txt 8 Sep 2002 23:53:23 -0000 1.12 --- README.txt 9 Sep 2002 16:19:39 -0000 1.13 *************** *** 22,25 **** --- 22,32 ---- Primary Files ============= + Options.py + A start at a flexible way to control what the tokenizer and + classifier do. Different people are finding different ways in + which their test data is biased, and so fiddle the code to + worm around that. It's become almost impossible to know + exactly what someone did when they report results. + classifier.py An implementation of a Graham-like classifier. From tim_one@users.sourceforge.net Mon Sep 9 17:39:31 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Mon, 09 Sep 2002 09:39:31 -0700 Subject: [Spambayes-checkins] spambayes timtest.py,1.15,1.16 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv27227 Modified Files: timtest.py Log Message: There's now a required int argument (-n) giving the number of ham/spam sets in "the standard" test directory setup. Also attempts to import bayescustomize. If that exists, it can be used to fiddle the settings in Options.options. Regardless of whether bayescustomize exists, the settings in Options.options are now displayed at the start of the run. Index: timtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timtest.py,v retrieving revision 1.15 retrieving revision 1.16 diff -C2 -d -r1.15 -r1.16 *** timtest.py 8 Sep 2002 23:49:51 -0000 1.15 --- timtest.py 9 Sep 2002 16:39:27 -0000 1.16 *************** *** 1,10 **** #! /usr/bin/env python ! NSETS = 5 ! SPAMDIRS = ["Data/Spam/Set%d" % i for i in range(1, NSETS+1)] ! HAMDIRS = ["Data/Ham/Set%d" % i for i in range(1, NSETS+1)] ! SPAMHAMDIRS = zip(SPAMDIRS, HAMDIRS) import os from sets import Set import cPickle as pickle --- 1,23 ---- #! /usr/bin/env python + # At the moment, this requires Python 2.3 from CVS (heapq, Set, enumerate). ! # A test driver using "the standard" test directory structure. See also ! # rates.py and cmp.py for summarizing results. ! ! """Usage: %(program)s [options] ! ! Where: ! -h ! Show usage and exit. ! -n int ! Number of Set directories (Data/Spam/Set1, ... and Data/Ham/Set1, ...). ! This is required. ! ! In addition, an attempt is made to import bayescustomize. If that exists, ! it can be used to change the settings in Options.options. ! """ import os + import sys from sets import Set import cPickle as pickle *************** *** 15,18 **** --- 28,39 ---- from tokenizer import tokenize + def usage(code, msg=''): + """Print usage message and sys.exit(code).""" + if msg: + print >> sys.stderr, msg + print >> sys.stderr + print >> sys.stderr, __doc__ % globals() + sys.exit(code) + class Hist: def __init__(self, nbuckets=20): *************** *** 217,226 **** self.trained_spam_hist += local_spam_hist ! def drive(): ! d = Driver() ! for spamdir, hamdir in SPAMHAMDIRS: d.train(MsgStream(hamdir), MsgStream(spamdir)) ! for sd2, hd2 in SPAMHAMDIRS: if (sd2, hd2) == (spamdir, hamdir): continue --- 238,254 ---- self.trained_spam_hist += local_spam_hist ! def drive(nsets): ! import Options ! print Options.options.display() ! ! spamdirs = ["Data/Spam/Set%d" % i for i in range(1, nsets+1)] ! hamdirs = ["Data/Ham/Set%d" % i for i in range(1, nsets+1)] ! spamhamdirs = zip(spamdirs, hamdirs) ! ! d = Driver() ! for spamdir, hamdir in spamhamdirs: d.train(MsgStream(hamdir), MsgStream(spamdir)) ! for sd2, hd2 in spamhamdirs: if (sd2, hd2) == (spamdir, hamdir): continue *************** *** 230,232 **** if __name__ == "__main__": ! drive() --- 258,282 ---- if __name__ == "__main__": ! import getopt ! ! try: ! opts, args = getopt.getopt(sys.argv[1:], 'hn:') ! except getopt.error, msg: ! usage(1, msg) ! ! nsets = None ! for opt, arg in opts: ! if opt == '-h': ! usage(0) ! elif opt == '-n': ! nsets = int(arg) ! ! if args: ! usage(1, "Positional arguments not supported") ! ! try: ! import bayescustomize ! except ImportError: ! pass ! ! drive(nsets) From tim_one@users.sourceforge.net Mon Sep 9 17:31:54 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Mon, 09 Sep 2002 09:31:54 -0700 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.10,1.11 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv25291 Modified Files: tokenizer.py Log Message: Whether the tokenizer strips HTML tags from pure HTML msgs is now controlled by the the setting of Options.options['retain_pure_html_tags']. Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.10 retrieving revision 1.11 diff -C2 -d -r1.10 -r1.11 *** tokenizer.py 9 Sep 2002 04:56:12 -0000 1.10 --- tokenizer.py 9 Sep 2002 16:31:50 -0000 1.11 *************** *** 5,8 **** --- 5,10 ---- from sets import Set + from Options import options + ############################################################################## # To fold case or not to fold case? I didn't want to fold case, because *************** *** 890,893 **** --- 892,907 ---- def tokenize_body(self, msg): + """Generate a stream of tokens from an email Message. + + If a multipart/alternative section has both text/plain and text/html + sections, the text/html section is ignored. This may not be a good + idea (e.g., the sections may have different content). + + HTML tags are always stripped from text/plain sections. + + Options.options['retain_pure_html_tags'] controls whether HTML tags are + also stripped from text/html sections. + """ + # Find, decode (base64, qp), and tokenize textual parts of the body. for part in textparts(msg): *************** *** 932,937 **** ## yield "src:" + x ! # Remove HTML/XML tags if it's a plain text message. ! if part.get_content_type() == "text/plain": text = html_re.sub(' ', text) --- 946,952 ---- ## yield "src:" + x ! # Remove HTML/XML tags. ! if (part.get_content_type() == "text/plain" or ! not options['retain_pure_html_tags']): text = html_re.sub(' ', text) From tim_one@users.sourceforge.net Mon Sep 9 19:49:21 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Mon, 09 Sep 2002 11:49:21 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.1,1.2 tokenizer.py,1.11,1.12 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv28560 Modified Files: Options.py tokenizer.py Log Message: Moved the safe_headers set into the options. Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** Options.py 9 Sep 2002 16:19:38 -0000 1.1 --- Options.py 9 Sep 2002 18:49:18 -0000 1.2 *************** *** 10,15 **** option_descriptions = """ retain_pure_html_tags default False ! By default, HTML tags are stripped from pure text/html messages. ! Set retain_pure_html_tags True to retain HTML tags in this case. """ --- 10,24 ---- option_descriptions = """ retain_pure_html_tags default False ! By default, tokenizer.Tokenizer.tokenize_headers() strips HTML tags ! stripped from pure text/html messages. Set to True to retain HTML tags ! in this case. ! ! safe_headers default Set("abuse-reports-to date errors-to from importance in-reply-to message-id mime-version organization received reply-to return-path subject to user-agent x-abuse-info x-complaints-to x-face".split()) ! tokenizer.Tokenizer.tokenize_headers() generates tokens just counting ! the number of instances of the headers in this set, in a case-sensitive ! way. Depending on data collection, some headers aren't safe to count. ! For example, if ham is collected from a mailing list but spam from your ! regular inbox traffic, the presence of a header like List-Info will be a ! very strong ham clue, but a bogus one. """ *************** *** 17,20 **** --- 26,30 ---- def __init__(self): self.optnames = Set() + evaldict = {'Set': Set} for line in option_descriptions.split('\n'): if not line or line.startswith(' '): *************** *** 24,28 **** self.optnames.add(name) i = line.index(' default ', i) ! self.setopt(name, eval(line[i+9:], {})) def _checkname(self, name): --- 34,38 ---- self.optnames.add(name) i = line.index(' default ', i) ! self.setopt(name, eval(line[i+9:], evaldict)) def _checkname(self, name): Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.11 retrieving revision 1.12 diff -C2 -d -r1.11 -r1.12 *** tokenizer.py 9 Sep 2002 16:31:50 -0000 1.11 --- tokenizer.py 9 Sep 2002 18:49:19 -0000 1.12 *************** *** 747,772 **** yield '.'.join(parts[:i]) - # We're merely going to count the number of these, and case-sensitively. - safe_headers = Set(""" - abuse-reports-to - date - errors-to - from - importance - in-reply-to - message-id - mime-version - organization - received - reply-to - return-path - subject - to - user-agent - x-abuse-info - x-complaints-to - x-face - """.split()) - class Tokenizer: --- 747,750 ---- *************** *** 884,887 **** --- 862,866 ---- # Do a "safe" approximation to that for now. + safe_headers = options['safe_headers'] x2n = {} for x in msg.keys(): From montanaro@users.sourceforge.net Mon Sep 9 20:23:26 2002 From: montanaro@users.sourceforge.net (Skip Montanaro) Date: Mon, 09 Sep 2002 12:23:26 -0700 Subject: [Spambayes-checkins] spambayes loosecksum.py,NONE,1.1 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv2588 Added Files: loosecksum.py Log Message: calculate a "loose" checksum for an email message --- NEW FILE: loosecksum.py --- #!/usr/local/bin/python """ Compute a 'loose' checksum on the msg (file on cmdline or via stdin). Attempts are made to eliminate content which tends to obscure the 'sameness' of messages. This is aimed particularly at spam, which tends to contains lots of small differences across messages to try and thwart spam filters, in hopes that at least one copy reaches its desitination. Before calculating the checksum, this script does the following: * delete the message header * delete HTML tags which generally contain URLs * delete anything which looks like an email address or URL * finally, discard everything other than ascii letters and digits (note that this will almost certainly be ineffectual for spam written in eastern languages such as Korean) An MD5 checksum is then computed for the resulting text and written to stdout. """ import getopt import sys import email.Parser import md5 import re import time import binascii def zaptags(data, *tags): """delete all tags (and /tags) from input data given as arguments""" for pat in tags: pat = pat.split(":") sub = "" if len(pat) >= 2: sub = pat[-1] pat = ":".join(pat[:-1]) else: pat = pat[0] sub = "" if '\\' in sub: sub = _zap_esc_map(sub) try: data = re.sub(r'(?i)]*)?>'%pat, sub, data) except TypeError: print (pat, sub, data) raise return data def clean(data): """Clean the obviously variable stuff from a chunk of data. The first (and perhaps only) use of this is to try and eliminate bits of data that keep multiple spam email messages from looking the same. """ # Get rid of any HTML tags that hold URLs - tend to have varying content # I suppose i could just get rid of all HTML tags data = zaptags(data, 'a', 'img', 'base', 'frame') # delete anything that looks like an email address data = re.sub(r"(?i)[-a-z0-9_.+]+@[-a-z0-9_.]+\.([a-z]+)", "", data) # delete anything that looks like a url (catch bare urls) data = re.sub(r"(?i)(ftp|http|gopher)://[-a-z0-9_/?&%@=+:;#!~|.,$*]+", "", data) # throw away everything other than alpha & digits return re.sub(r"[^A-Za-z0-9]+", "", data) def flatten(obj): # I do not know how to use the email package very well - all I want here # is the body of obj expressed as a string - there is probably a better # way to accomplish this which I haven't discovered. # three types are possible: string, Message (hasattr(get_payload)), list if isinstance(obj, str): return obj if hasattr(obj, "get_payload"): return flatten(obj.get_payload()) if isinstance(obj, list): return "\n".join([flatten(b) for b in body]) raise TypeError, ("unrecognized body type: %s" % type(body)) def generate_checksum(f): body = flatten(email.Parser.Parser().parse(f)) return binascii.b2a_hex(md5.new(clean(body)).digest()) def main(args): opts, args = getopt.getopt(args, "") for opt, arg in opts: pass if not args: inf = sys.stdin else: inf = file(args[0]) print generate_checksum(inf) if __name__ == "__main__": main(sys.argv[1:]) From montanaro@users.sourceforge.net Mon Sep 9 20:24:54 2002 From: montanaro@users.sourceforge.net (Skip Montanaro) Date: Mon, 09 Sep 2002 12:24:54 -0700 Subject: [Spambayes-checkins] spambayes README.txt,1.13,1.14 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv2873 Modified Files: README.txt Log Message: add blurb about loosecksum.py Index: README.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/README.txt,v retrieving revision 1.13 retrieving revision 1.14 diff -C2 -d -r1.13 -r1.14 *** README.txt 9 Sep 2002 16:19:39 -0000 1.13 --- README.txt 9 Sep 2002 19:24:52 -0000 1.14 *************** *** 79,82 **** --- 79,86 ---- useful to delete headers which incorrectly might bias the results. + loosecksum.py + A script to calculate a "loose" checksum for a message. See the text of + the script for an operational definition of "loose". + mboxcount.py Count the number of messages (both parseable and unparseable) in From nascheme@users.sourceforge.net Mon Sep 9 22:21:56 2002 From: nascheme@users.sourceforge.net (Neil Schemenauer) Date: Mon, 09 Sep 2002 14:21:56 -0700 Subject: [Spambayes-checkins] spambayes cdb.py,NONE,1.1 neilfilter.py,NONE,1.1 neiltrain.py,NONE,1.1 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv8534 Added Files: cdb.py neilfilter.py neiltrain.py Log Message: Add a pure Python implementation of CDB and two scripts that use it. It seems pretty zippy for both reading and creating. --- NEW FILE: cdb.py --- """ Dan Bernstein's CDB implemented in Python see http://cr.yp.to/cdb.html """ import os import struct import mmap import sys def uint32_unpack(buf): return struct.unpack('>= 8 u %= self.hslots u <<= 3 self.kpos = self.hpos + u while self.loop < self.hslots: buf = self.read(8, self.kpos) pos = uint32_unpack(buf[4:]) if not pos: raise KeyError self.loop += 1 self.kpos += 8 if self.kpos == self.hpos + (self.hslots << 3): self.kpos = self.hpos u = uint32_unpack(buf[:4]) if u == self.khash: buf = self.read(8, pos) u = uint32_unpack(buf[:4]) if u == len(key): if self.match(key, pos + 8): dlen = uint32_unpack(buf[4:]) dpos = pos + 8 + len(key) return self.read(dlen, dpos) raise KeyError def __getitem__(self, key): self.findstart() return self.findnext(key) def get(self, key, default=None): self.findstart() try: return self.findnext(key) except KeyError: return default def cdb_make(outfile, items): pos = 2048 tables = {} # { h & 255 : [(h, p)] } # write keys and data outfile.seek(pos) for key, value in items: outfile.write(uint32_pack(len(key)) + uint32_pack(len(value))) h = cdb_hash(key) outfile.write(key) outfile.write(value) tables.setdefault(h & 255, []).append((h, pos)) pos += 8 + len(key) + len(value) final = '' # write hash tables for i in range(256): entries = tables.get(i, []) nslots = 2*len(entries) final += uint32_pack(pos) + uint32_pack(nslots) null = (0, 0) table = [null] * nslots for h, p in entries: n = (h >> 8) % nslots while table[n] is not null: n = (n + 1) % nslots table[n] = (h, p) for h, p in table: outfile.write(uint32_pack(h) + uint32_pack(p)) pos += 8 # write header (pointers to tables and their lengths) outfile.flush() outfile.seek(0) outfile.write(final) def test(): #db = Cdb(open("t")) #print db['one'] #print db['two'] #print db['foo'] #print db['us'] #print db.get('ec') #print db.get('notthere') db = open('test.cdb', 'wb') cdb_make(db, [('one', 'Hello'), ('two', 'Goodbye'), ('foo', 'Bar'), ('us', 'United States'), ]) db.close() db = Cdb(open("test.cdb", 'rb')) print db['one'] print db['two'] print db['foo'] print db['us'] print db.get('ec') print db.get('notthere') if __name__ == '__main__': test() --- NEW FILE: neilfilter.py --- #! /usr/bin/env python """Usage: %(program)s wordprobs.cdb """ import sys import os import email from heapq import heapreplace from sets import Set from classifier import MIN_SPAMPROB, MAX_SPAMPROB, UNKNOWN_SPAMPROB, \ MAX_DISCRIMINATORS import cdb program = sys.argv[0] # For usage(); referenced by docstring above from tokenizer import tokenize def spamprob(wordprobs, wordstream, evidence=False): """Return best-guess probability that wordstream is spam. wordprobs is a CDB of word probabilities wordstream is an iterable object producing words. The return value is a float in [0.0, 1.0]. If optional arg evidence is True, the return value is a pair probability, evidence where evidence is a list of (word, probability) pairs. """ # A priority queue to remember the MAX_DISCRIMINATORS best # probabilities, where "best" means largest distance from 0.5. # The tuples are (distance, prob, word). nbest = [(-1.0, None, None)] * MAX_DISCRIMINATORS smallest_best = -1.0 mins = [] # all words w/ prob MIN_SPAMPROB maxs = [] # all words w/ prob MAX_SPAMPROB # Counting a unique word multiple times hurts, although counting one # at most two times had some benefit whan UNKNOWN_SPAMPROB was 0.2. # When that got boosted to 0.5, counting more than once became # counterproductive. for word in Set(wordstream): prob = float(wordprobs.get(word, UNKNOWN_SPAMPROB)) distance = abs(prob - 0.5) if prob == MIN_SPAMPROB: mins.append((distance, prob, word)) elif prob == MAX_SPAMPROB: maxs.append((distance, prob, word)) elif distance > smallest_best: # Subtle: we didn't use ">" instead of ">=" just to save # calls to heapreplace(). The real intent is that if # there are many equally strong indicators throughout the # message, we want to favor the ones that appear earliest: # it's expected that spam headers will often have smoking # guns, and, even when not, spam has to grab your attention # early (& note that when spammers generate large blocks of # random gibberish to throw off exact-match filters, it's # always at the end of the msg -- if they put it at the # start, *nobody* would read the msg). heapreplace(nbest, (distance, prob, word)) smallest_best = nbest[0][0] # Compute the probability. Note: This is what Graham's code did, # but it's dubious for reasons explained in great detail on Python- # Dev: it's missing P(spam) and P(not-spam) adjustments that # straightforward Bayesian analysis says should be here. It's # unclear how much it matters, though, as the omissions here seem # to tend in part to cancel out distortions introduced earlier by # HAMBIAS. Experiments will decide the issue. clues = [] # First cancel out competing extreme clues (see comment block at # MAX_DISCRIMINATORS declaration -- this is a twist on Graham). if mins or maxs: if len(mins) < len(maxs): shorter, longer = mins, maxs else: shorter, longer = maxs, mins tokeep = min(len(longer) - len(shorter), MAX_DISCRIMINATORS) # They're all good clues, but we're only going to feed the tokeep # initial clues from the longer list into the probability # computation. for dist, prob, word in shorter + longer[tokeep:]: if evidence: clues.append((word, prob)) for x in longer[:tokeep]: heapreplace(nbest, x) prob_product = inverse_prob_product = 1.0 for distance, prob, word in nbest: if prob is None: # it's one of the dummies nbest started with continue if evidence: clues.append((word, prob)) prob_product *= prob inverse_prob_product *= 1.0 - prob prob = prob_product / (prob_product + inverse_prob_product) if evidence: clues.sort(lambda a, b: cmp(a[1], b[1])) return prob, clues else: return prob def formatclues(clues, sep="; "): """Format the clues into something readable.""" return sep.join(["%r: %.2f" % (word, prob) for word, prob in clues]) def is_spam(wordprobs, input): """Filter (judge) a message""" msg = email.message_from_file(input) prob, clues = spamprob(wordprobs, tokenize(msg), True) #print "%.2f;" % prob, formatclues(clues) if prob < 0.9: return False else: return True def usage(code, msg=''): """Print usage message and sys.exit(code).""" if msg: print >> sys.stderr, msg print >> sys.stderr print >> sys.stderr, __doc__ % globals() sys.exit(code) def main(): if len(sys.argv) != 2: usage(2) wordprobs = cdb.Cdb(open(sys.argv[1], 'rb')) if is_spam(wordprobs, sys.stdin): sys.exit(1) else: sys.exit(0) if __name__ == "__main__": main() --- NEW FILE: neiltrain.py --- #! /usr/bin/env python """Usage: %(program)s spam.mbox ham.mbox wordprobs.cdb """ import sys import os import mailbox import email import classifier import cdb program = sys.argv[0] # For usage(); referenced by docstring above from tokenizer import tokenize def getmbox(msgs): """Return an iterable mbox object""" def _factory(fp): try: return email.message_from_file(fp) except email.Errors.MessageParseError: return '' if msgs.startswith("+"): import mhlib mh = mhlib.MH() mbox = mailbox.MHMailbox(os.path.join(mh.getpath(), msgs[1:]), _factory) elif os.path.isdir(msgs): # XXX Bogus: use an MHMailbox if the pathname contains /Mail/, # else a DirOfTxtFileMailbox. if msgs.find("/Mail/") >= 0: mbox = mailbox.MHMailbox(msgs, _factory) else: mbox = DirOfTxtFileMailbox(msgs, _factory) else: fp = open(msgs) mbox = mailbox.PortableUnixMailbox(fp, _factory) return mbox def train(bayes, msgs, is_spam): """Train bayes with all messages from a mailbox.""" mbox = getmbox(msgs) for msg in mbox: bayes.learn(tokenize(msg), is_spam, False) def usage(code, msg=''): """Print usage message and sys.exit(code).""" if msg: print >> sys.stderr, msg print >> sys.stderr print >> sys.stderr, __doc__ % globals() sys.exit(code) def main(): """Main program; parse options and go.""" if len(sys.argv) != 4: usage(2) spam_name = sys.argv[1] ham_name = sys.argv[2] db_name = sys.argv[3] bayes = classifier.GrahamBayes() print 'Training with spam...' train(bayes, spam_name, True) print 'Training with ham...' train(bayes, ham_name, False) print 'Updating probabilities...' bayes.update_probabilities() items = [] for word, winfo in bayes.wordinfo.iteritems(): #print `word`, str(winfo.spamprob) items.append((word, str(winfo.spamprob))) print 'Writing DB...' db = open(db_name, "wb") cdb.cdb_make(db, items) db.close() print 'done' if __name__ == "__main__": main() From tim_one@users.sourceforge.net Tue Sep 10 01:06:39 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Mon, 09 Sep 2002 17:06:39 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.3,1.4 bayes.ini,1.1,1.2 timtest.py,1.17,1.18 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv30316 Modified Files: Options.py bayes.ini timtest.py Log Message: Added a bunch of options to control the test driver. Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** Options.py 9 Sep 2002 20:37:14 -0000 1.3 --- Options.py 10 Sep 2002 00:06:36 -0000 1.4 *************** *** 9,18 **** from sets import Set ! __all__ = ['buildoptions', 'options'] all_options = { ! 'Tokenizer': {'retain_pure_html_tags': ('getboolean', lambda i: bool(i)), 'safe_headers': ('get', lambda s: Set(s.split())), }, } --- 9,32 ---- from sets import Set ! __all__ = ['options'] ! ! int_cracker = ('getint', None) ! float_cracker = ('getfloat', None) ! boolean_cracker = ('getboolean', bool) all_options = { ! 'Tokenizer': {'retain_pure_html_tags': boolean_cracker, 'safe_headers': ('get', lambda s: Set(s.split())), }, + 'TestDriver': {'nbuckets': int_cracker, + 'show_ham_lo': float_cracker, + 'show_ham_hi': float_cracker, + 'show_spam_lo': float_cracker, + 'show_spam_hi': float_cracker, + 'show_false_positives': boolean_cracker, + 'show_false_negatives': boolean_cracker, + 'show_histograms': boolean_cracker, + 'show_best_discriminators': boolean_cracker, + } } *************** *** 39,44 **** continue fetcher, converter = goodopts[option] ! rawvalue = getattr(c, fetcher)(section, option) ! value = converter(rawvalue) setattr(options, option, value) --- 53,59 ---- continue fetcher, converter = goodopts[option] ! value = getattr(c, fetcher)(section, option) ! if converter is not None: ! value = converter(value) setattr(options, option, value) Index: bayes.ini =================================================================== RCS file: /cvsroot/spambayes/spambayes/bayes.ini,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** bayes.ini 9 Sep 2002 20:37:14 -0000 1.1 --- bayes.ini 10 Sep 2002 00:06:37 -0000 1.2 *************** *** 1,6 **** [Tokenizer] # By default, tokenizer.Tokenizer.tokenize_headers() strips HTML tags ! # stripped from pure text/html messages. Set to True to retain HTML tags ! # in this case. retain_pure_html_tags: False --- 1,6 ---- [Tokenizer] # By default, tokenizer.Tokenizer.tokenize_headers() strips HTML tags ! # from pure text/html messages. Set to True to retain HTML tags in ! # this case. retain_pure_html_tags: False *************** *** 29,30 **** --- 29,49 ---- x-complaints-to x-face + + [TestDriver] + # These control various displays in class Drive (timtest.py). + + # Number of buckets in histograms. + nbuckets: 40 + show_histograms: True + + # Display spam when + # show_spam_lo <= spamprob <= show_spam_hi + # and likewise for ham. The defaults here don't show anything. + show_spam_lo: 1.0 + show_spam_hi: 0.0 + show_ham_lo: 1.0 + show_ham_hi: 0.0 + + show_false_positives: True + show_false_negatives: False + show_best_discriminators: True Index: timtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timtest.py,v retrieving revision 1.17 retrieving revision 1.18 diff -C2 -d -r1.17 -r1.18 *** timtest.py 9 Sep 2002 20:37:14 -0000 1.17 --- timtest.py 10 Sep 2002 00:06:37 -0000 1.18 *************** *** 5,9 **** # rates.py and cmp.py for summarizing results. ! """Usage: %(program)s [options] Where: --- 5,9 ---- # rates.py and cmp.py for summarizing results. ! """Usage: %(program)s [-h] -n nsets Where: *************** *** 27,31 **** import classifier from tokenizer import tokenize ! import Options def usage(code, msg=''): --- 27,33 ---- import classifier from tokenizer import tokenize ! from Options import options ! ! program = sys.argv[0] def usage(code, msg=''): *************** *** 145,154 **** class Driver: ! def __init__(self, nbuckets=40): ! self.nbuckets = nbuckets self.falsepos = Set() self.falseneg = Set() ! self.global_ham_hist = Hist(self.nbuckets) ! self.global_spam_hist = Hist(self.nbuckets) def train(self, ham, spam): --- 147,155 ---- class Driver: ! def __init__(self): self.falsepos = Set() self.falseneg = Set() ! self.global_ham_hist = Hist(options.nbuckets) ! self.global_spam_hist = Hist(options.nbuckets) def train(self, ham, spam): *************** *** 160,165 **** print t.nham, "hams &", t.nspam, "spams" ! self.trained_ham_hist = Hist(self.nbuckets) ! self.trained_spam_hist = Hist(self.nbuckets) #f = file('w.pik', 'wb') --- 161,166 ---- print t.nham, "hams &", t.nspam, "spams" ! self.trained_ham_hist = Hist(options.nbuckets) ! self.trained_spam_hist = Hist(options.nbuckets) #f = file('w.pik', 'wb') *************** *** 170,195 **** def finishtest(self): ! printhist("all in this training set:", ! self.trained_ham_hist, self.trained_spam_hist) self.global_ham_hist += self.trained_ham_hist self.global_spam_hist += self.trained_spam_hist def alldone(self): ! printhist("all runs:", self.global_ham_hist, self.global_spam_hist) def test(self, ham, spam, charlimit=None): c = self.classifier t = self.tester ! local_ham_hist = Hist(self.nbuckets) ! local_spam_hist = Hist(self.nbuckets) ! def new_ham(msg, prob): local_ham_hist.add(prob) ! def new_spam(msg, prob): local_spam_hist.add(prob) ! if prob < 0.1: print ! print "Low prob spam!", prob prob, clues = c.spamprob(msg, True) printmsg(msg, prob, clues, charlimit) --- 171,205 ---- def finishtest(self): ! if options.show_histograms: ! printhist("all in this training set:", ! self.trained_ham_hist, self.trained_spam_hist) self.global_ham_hist += self.trained_ham_hist self.global_spam_hist += self.trained_spam_hist def alldone(self): ! if options.show_histograms: ! printhist("all runs:", self.global_ham_hist, self.global_spam_hist) def test(self, ham, spam, charlimit=None): c = self.classifier t = self.tester ! local_ham_hist = Hist(options.nbuckets) ! local_spam_hist = Hist(options.nbuckets) ! def new_ham(msg, prob, lo=options.show_ham_lo, ! hi=options.show_ham_hi): local_ham_hist.add(prob) + if lo <= prob <= hi: + print + print "Ham with prob =", prob + prob, clues = c.spamprob(msg, True) + printmsg(msg, prob, clues, charlimit) ! def new_spam(msg, prob, lo=options.show_spam_lo, ! hi=options.show_spam_hi): local_spam_hist.add(prob) ! if lo <= prob <= hi: print ! print "Spam with prob =", prob prob, clues = c.spamprob(msg, True) printmsg(msg, prob, clues, charlimit) *************** *** 207,210 **** --- 217,222 ---- self.falsepos |= newfpos print " new false positives:", [e.tag for e in newfpos] + if not options.show_false_positives: + newfpos = () for e in newfpos: print '*' * 78 *************** *** 215,244 **** self.falseneg |= newfneg print " new false negatives:", [e.tag for e in newfneg] ! for e in []:#newfneg: print '*' * 78 prob, clues = c.spamprob(e, True) printmsg(e, prob, clues, 1000) ! print ! print " best discriminators:" ! stats = [(-1, None) for i in range(30)] ! smallest_killcount = -1 ! for w, r in c.wordinfo.iteritems(): ! if r.killcount > smallest_killcount: ! heapreplace(stats, (r.killcount, w)) ! smallest_killcount = stats[0][0] ! stats.sort() ! for count, w in stats: ! if count < 0: ! continue ! r = c.wordinfo[w] ! print " %r %d %g" % (w, r.killcount, r.spamprob) ! printhist("this pair:", local_ham_hist, local_spam_hist) self.trained_ham_hist += local_ham_hist self.trained_spam_hist += local_spam_hist def drive(nsets): ! print Options.options.display() spamdirs = ["Data/Spam/Set%d" % i for i in range(1, nsets+1)] --- 227,260 ---- self.falseneg |= newfneg print " new false negatives:", [e.tag for e in newfneg] ! if not options.show_false_negatives: ! newfneg = () ! for e in newfneg: print '*' * 78 prob, clues = c.spamprob(e, True) printmsg(e, prob, clues, 1000) ! if options.show_best_discriminators: ! print ! print " best discriminators:" ! stats = [(-1, None) for i in range(30)] ! smallest_killcount = -1 ! for w, r in c.wordinfo.iteritems(): ! if r.killcount > smallest_killcount: ! heapreplace(stats, (r.killcount, w)) ! smallest_killcount = stats[0][0] ! stats.sort() ! for count, w in stats: ! if count < 0: ! continue ! r = c.wordinfo[w] ! print " %r %d %g" % (w, r.killcount, r.spamprob) ! if options.show_histograms: ! printhist("this pair:", local_ham_hist, local_spam_hist) self.trained_ham_hist += local_ham_hist self.trained_spam_hist += local_spam_hist def drive(nsets): ! print options.display() spamdirs = ["Data/Spam/Set%d" % i for i in range(1, nsets+1)] *************** *** 273,277 **** if args: usage(1, "Positional arguments not supported") ! Options.options.mergefiles(['bayescustomize.ini']) drive(nsets) --- 289,295 ---- if args: usage(1, "Positional arguments not supported") + if nsets is None: + usage(1, "-n is required") ! options.mergefiles(['bayescustomize.ini']) drive(nsets) From tim_one@users.sourceforge.net Tue Sep 10 02:53:14 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Mon, 09 Sep 2002 18:53:14 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.4,1.5 timtest.py,1.18,1.19 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv9100 Modified Files: Options.py timtest.py Log Message: Screwed my head on straight: Options should take care of merging in bayescustomize.ini rather than making every client muck with it. Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** Options.py 10 Sep 2002 00:06:36 -0000 1.4 --- Options.py 10 Sep 2002 01:53:12 -0000 1.5 *************** *** 65,67 **** options = OptionsClass() ! options.mergefiles(['bayes.ini']) --- 65,67 ---- options = OptionsClass() ! options.mergefiles(['bayes.ini', 'bayescustomize.ini']) Index: timtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timtest.py,v retrieving revision 1.18 retrieving revision 1.19 diff -C2 -d -r1.18 -r1.19 *** timtest.py 10 Sep 2002 00:06:37 -0000 1.18 --- timtest.py 10 Sep 2002 01:53:12 -0000 1.19 *************** *** 292,295 **** usage(1, "-n is required") - options.mergefiles(['bayescustomize.ini']) drive(nsets) --- 292,294 ---- From tim_one@users.sourceforge.net Tue Sep 10 17:02:45 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Tue, 10 Sep 2002 09:02:45 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.5,1.6 bayes.ini,1.2,1.3 tokenizer.py,1.13,1.14 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv30619 Modified Files: Options.py bayes.ini tokenizer.py Log Message: Added option Tokenizer/count_all_header_lines. Defaults to False. You can override by creating a bayescustomize.ini. When True, the safe_headers option is ignored and Anthony's code to count *all* header lines is used instead. This is almost certainly a Good Thing to do if your ham and spam come from the same source, and almost certainly a Bad Thing to do if they're from different sources (too many clues about the source are likely to appear in the header-line counts). Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** Options.py 10 Sep 2002 01:53:12 -0000 1.5 --- Options.py 10 Sep 2002 16:02:40 -0000 1.6 *************** *** 18,21 **** --- 18,22 ---- 'Tokenizer': {'retain_pure_html_tags': boolean_cracker, 'safe_headers': ('get', lambda s: Set(s.split())), + 'count_all_header_lines': boolean_cracker, }, 'TestDriver': {'nbuckets': int_cracker, *************** *** 28,32 **** 'show_histograms': boolean_cracker, 'show_best_discriminators': boolean_cracker, ! } } --- 29,33 ---- 'show_histograms': boolean_cracker, 'show_best_discriminators': boolean_cracker, ! }, } Index: bayes.ini =================================================================== RCS file: /cvsroot/spambayes/spambayes/bayes.ini,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** bayes.ini 10 Sep 2002 00:06:37 -0000 1.2 --- bayes.ini 10 Sep 2002 16:02:41 -0000 1.3 *************** *** 5,14 **** retain_pure_html_tags: False ! # tokenizer.Tokenizer.tokenize_headers() generates tokens just counting ! # the number of instances of the headers in this set, in a case-sensitive ! # way. Depending on data collection, some headers aren't safe to count. # For example, if ham is collected from a mailing list but spam from your # regular inbox traffic, the presence of a header like List-Info will be a ! # very strong ham clue, but a bogus one. safe_headers: abuse-reports-to date --- 5,22 ---- retain_pure_html_tags: False ! # Generate tokens just counting the number of instances of each kind of ! # header line, in a case-sensitive way. ! # ! # Depending on data collection, some headers aren't safe to count. # For example, if ham is collected from a mailing list but spam from your # regular inbox traffic, the presence of a header like List-Info will be a ! # very strong ham clue, but a bogus one. In that case, set ! # count_all_header_lines to False, and adjust safe_headers instead. ! ! count_all_header_lines: False ! ! # Like count_all_header_lines, but restricted to headers in this list. ! # safe_headers is ignored when count_all_header_lines is true. ! safe_headers: abuse-reports-to date Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.13 retrieving revision 1.14 diff -C2 -d -r1.13 -r1.14 *** tokenizer.py 9 Sep 2002 20:37:14 -0000 1.13 --- tokenizer.py 10 Sep 2002 16:02:41 -0000 1.14 *************** *** 839,870 **** yield 'received:' + tok ! # XXX Following is a great idea due to Anthony Baxter. I can't use it ! # XXX on my test data because the header lines are so different between ! # XXX my ham and spam that it makes a large improvement for bogus ! # XXX reasons. So it's commented out. But it's clearly a good thing ! # XXX to do on "normal" data, and subsumes the Organization trick above ! # XXX in a much more general way, yet at comparable cost. ! ! # X-UIDL: ! # Anthony Baxter's idea. This has spamprob 0.99! The value ! # is clearly irrelevant, just the presence or absence matters. ! # However, it's extremely rare in my spam sets, so doesn't ! # have much value. ! # ! # As also suggested by Anthony, we can capture all such header ! # oddities just by generating tags for the count of how many ! # times each header field appears. ! ##x2n = {} ! ##for x in msg.keys(): ! ## x2n[x] = x2n.get(x, 0) + 1 ! ##for x in x2n.items(): ! ## yield "header:%s:%d" % x ! ! # Do a "safe" approximation to that for now. ! safe_headers = options.safe_headers x2n = {} ! for x in msg.keys(): ! if x.lower() in safe_headers: x2n[x] = x2n.get(x, 0) + 1 for x in x2n.items(): yield "header:%s:%d" % x --- 839,859 ---- yield 'received:' + tok ! # As suggested by Anthony Baxter, merely counting the number of ! # header lines, and in a case-sensitive way, has really value. ! # For example, all-caps SUBJECT is a strong spam clue, while ! # X-Complaints-To a strong ham clue. x2n = {} ! if options.count_all_header_lines: ! for x in msg.keys(): x2n[x] = x2n.get(x, 0) + 1 + else: + # Do a "safe" approximation to that. When spam and ham are + # collected from different sources, the count of some header + # lines can be a too strong a discriminator for accidental + # reasons. + safe_headers = options.safe_headers + for x in msg.keys(): + if x.lower() in safe_headers: + x2n[x] = x2n.get(x, 0) + 1 for x in x2n.items(): yield "header:%s:%d" % x From tim_one@users.sourceforge.net Tue Sep 10 19:03:39 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Tue, 10 Sep 2002 11:03:39 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.6,1.7 bayes.ini,1.3,NONE Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv15128 Modified Files: Options.py Removed Files: bayes.ini Log Message: Removed bayes.ini from the project and embedded its contents in Options.py. This way search-path issues can't stop the correct defaults from getting set, and people are forced to use the intended bayescustomize.ini for customization instead of fiddling bayes.ini. Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** Options.py 10 Sep 2002 16:02:40 -0000 1.6 --- Options.py 10 Sep 2002 18:03:27 -0000 1.7 *************** *** 11,14 **** --- 11,74 ---- __all__ = ['options'] + defaults = """ + [Tokenizer] + # By default, tokenizer.Tokenizer.tokenize_headers() strips HTML tags + # from pure text/html messages. Set to True to retain HTML tags in + # this case. + retain_pure_html_tags: False + + # Generate tokens just counting the number of instances of each kind of + # header line, in a case-sensitive way. + # + # Depending on data collection, some headers aren't safe to count. + # For example, if ham is collected from a mailing list but spam from your + # regular inbox traffic, the presence of a header like List-Info will be a + # very strong ham clue, but a bogus one. In that case, set + # count_all_header_lines to False, and adjust safe_headers instead. + + count_all_header_lines: False + + # Like count_all_header_lines, but restricted to headers in this list. + # safe_headers is ignored when count_all_header_lines is true. + + safe_headers: abuse-reports-to + date + errors-to + from + importance + in-reply-to + message-id + mime-version + organization + received + reply-to + return-path + subject + to + user-agent + x-abuse-info + x-complaints-to + x-face + + [TestDriver] + # These control various displays in class Drive (timtest.py). + + # Number of buckets in histograms. + nbuckets: 40 + show_histograms: True + + # Display spam when + # show_spam_lo <= spamprob <= show_spam_hi + # and likewise for ham. The defaults here don't show anything. + show_spam_lo: 1.0 + show_spam_hi: 0.0 + show_ham_lo: 1.0 + show_ham_hi: 0.0 + + show_false_positives: True + show_false_negatives: False + show_best_discriminators: True + """ + int_cracker = ('getint', None) float_cracker = ('getfloat', None) *************** *** 40,49 **** def mergefiles(self, fnamelist): ! c = self._config ! c.read(fnamelist) for section in c.sections(): if section not in all_options: _warn("config file has unknown section %r" % section) continue goodopts = all_options[section] --- 100,117 ---- def mergefiles(self, fnamelist): ! self._config.read(fnamelist) ! self._update() ! ! def mergefilelike(self, filelike): ! self._config.readfp(filelike) ! self._update() + def _update(self): + nerrors = 0 + c = self._config for section in c.sections(): if section not in all_options: _warn("config file has unknown section %r" % section) + nerrors += 1 continue goodopts = all_options[section] *************** *** 52,55 **** --- 120,124 ---- _warn("config file has unknown option %r in " "section %r" % (option, section)) + nerrors += 1 continue fetcher, converter = goodopts[option] *************** *** 58,61 **** --- 127,132 ---- value = converter(value) setattr(options, option, value) + if nerrors: + raise ValueError("errors while parsing .ini file") def display(self): *************** *** 66,68 **** options = OptionsClass() ! options.mergefiles(['bayes.ini', 'bayescustomize.ini']) --- 137,144 ---- options = OptionsClass() ! ! d = StringIO.StringIO(defaults) ! options.mergefilelike(d) ! del d ! ! options.mergefiles(['bayescustomize.ini']) --- bayes.ini DELETED --- From tim_one@users.sourceforge.net Tue Sep 10 19:16:42 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Tue, 10 Sep 2002 11:16:42 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.7,1.8 tokenizer.py,1.14,1.15 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv18467 Modified Files: Options.py tokenizer.py Log Message: tokenize_headers(): Updated some comments. Added new Tokenizer/mine_received_headers bool option to enable Neil Schemenauer's special processing of Received: headers. Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.7 retrieving revision 1.8 diff -C2 -d -r1.7 -r1.8 *** Options.py 10 Sep 2002 18:03:27 -0000 1.7 --- Options.py 10 Sep 2002 18:15:48 -0000 1.8 *************** *** 51,54 **** --- 51,59 ---- x-face + # A lot of clues can be gotten from IP addresses and names in Received: + # headers. Again this can give spectacular results for bogus reasons + # if your test corpora are from different sources. Else set this to true. + mine_received_headers: False + [TestDriver] # These control various displays in class Drive (timtest.py). *************** *** 79,82 **** --- 84,88 ---- 'safe_headers': ('get', lambda s: Set(s.split())), 'count_all_header_lines': boolean_cracker, + 'mine_received_headers': boolean_cracker, }, 'TestDriver': {'nbuckets': int_cracker, Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.14 retrieving revision 1.15 diff -C2 -d -r1.14 -r1.15 *** tokenizer.py 10 Sep 2002 16:02:41 -0000 1.14 --- tokenizer.py 10 Sep 2002 18:15:49 -0000 1.15 *************** *** 783,786 **** --- 783,788 ---- # XXX some "safe" header lines are included here, where "safe" # XXX is specific to my sorry corpora. + # XXX Jeremy Hylton also reported good results from the general + # XXX header-mining in mboxtest.MyTokenizer.tokenize_headers. # Content-{Type, Disposition} and their params, and charsets. *************** *** 815,823 **** # X-Mailer: This is a pure and significant win for the f-n rate; f-p # rate isn't affected. - # User-Agent: Skipping it, as it made no difference. Very few spams - # had a User-Agent field, but lots of hams didn't either, - # and the spam probability of User-Agent was very close to - # 0.5 (== not a valuable discriminator) across all - # training sets. for field in ('x-mailer',): prefix = field + ':' --- 817,820 ---- *************** *** 826,834 **** # Received: ! # Neil Schemenauer reported good results from tokenizing prefixes ! # of the embedded IP addresses. ! # XXX This is disabled only because it's "too good" when used on ! # XXX Tim's mixed-source corpora. ! if 0: for header in msg.get_all("received", ()): for pat, breakdown in [(received_host_re, breakdown_host), --- 823,828 ---- # Received: ! # Neil Schemenauer reports good results from this. ! if options.mine_received_headers: for header in msg.get_all("received", ()): for pat, breakdown in [(received_host_re, breakdown_host), *************** *** 840,844 **** # As suggested by Anthony Baxter, merely counting the number of ! # header lines, and in a case-sensitive way, has really value. # For example, all-caps SUBJECT is a strong spam clue, while # X-Complaints-To a strong ham clue. --- 834,838 ---- # As suggested by Anthony Baxter, merely counting the number of ! # header lines, and in a case-sensitive way, has real value. # For example, all-caps SUBJECT is a strong spam clue, while # X-Complaints-To a strong ham clue. From tim_one@users.sourceforge.net Wed Sep 11 01:22:59 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Tue, 10 Sep 2002 17:22:59 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.8,1.9 timtest.py,1.19,1.20 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv26264 Modified Files: Options.py timtest.py Log Message: Added options [TestDriver] save_trained_pickles: False pickle_basename: class Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.8 retrieving revision 1.9 diff -C2 -d -r1.8 -r1.9 *** Options.py 10 Sep 2002 18:15:48 -0000 1.8 --- Options.py 11 Sep 2002 00:22:56 -0000 1.9 *************** *** 57,61 **** [TestDriver] ! # These control various displays in class Drive (timtest.py). # Number of buckets in histograms. --- 57,61 ---- [TestDriver] ! # These control various displays in class Driver (timtest.py). # Number of buckets in histograms. *************** *** 74,77 **** --- 74,88 ---- show_false_negatives: False show_best_discriminators: True + + # If save_trained_pickles is true, Driver.train() saves a binary pickle + # of the classifier after training. The file basename is given by + # pickle_basename, the extension is .pik, and increasing integers are + # appended to pickle_basename. By default (if save_trained_pickles is + # true), the filenames are class1.pik, class2.pik, ... If a file of that + # name already exists, it's overwritten. pickle_basename is ignored when + # save_trained_pickles is false. + + save_trained_pickles: False + pickle_basename: class """ *************** *** 79,82 **** --- 90,94 ---- float_cracker = ('getfloat', None) boolean_cracker = ('getboolean', bool) + string_cracker = ('get', None) all_options = { *************** *** 95,98 **** --- 107,112 ---- 'show_histograms': boolean_cracker, 'show_best_discriminators': boolean_cracker, + 'save_trained_pickles': boolean_cracker, + 'pickle_basename': string_cracker, }, } Index: timtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timtest.py,v retrieving revision 1.19 retrieving revision 1.20 diff -C2 -d -r1.19 -r1.20 *** timtest.py 10 Sep 2002 01:53:12 -0000 1.19 --- timtest.py 11 Sep 2002 00:22:56 -0000 1.20 *************** *** 152,155 **** --- 152,156 ---- self.global_ham_hist = Hist(options.nbuckets) self.global_spam_hist = Hist(options.nbuckets) + self.ntimes_train_called = 0 def train(self, ham, spam): *************** *** 164,172 **** self.trained_spam_hist = Hist(options.nbuckets) ! #f = file('w.pik', 'wb') ! #pickle.dump(self.classifier, f, 1) ! #f.close() ! #import sys ! #sys.exit(0) def finishtest(self): --- 165,176 ---- self.trained_spam_hist = Hist(options.nbuckets) ! self.ntimes_train_called += 1 ! if options.save_trained_pickles: ! fname = "%s%d.pik" % (options.pickle_basename, ! self.ntimes_train_called) ! print " saving pickle to", fname ! fp = file(fname, 'wb') ! pickle.dump(self.classifier, fp, 1) ! fp.close() def finishtest(self): From rubiconx@users.sourceforge.net Wed Sep 11 07:21:25 2002 From: rubiconx@users.sourceforge.net (Neale Pickett) Date: Tue, 10 Sep 2002 23:21:25 -0700 Subject: [Spambayes-checkins] spambayes cdb.py,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv9505 Modified Files: cdb.py Log Message: Added some more dict-like methods to the Cdb class, and a cdb_dump function that generates output identical to djb's cdbdump program. Index: cdb.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/cdb.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** cdb.py 9 Sep 2002 21:21:54 -0000 1.1 --- cdb.py 11 Sep 2002 06:21:22 -0000 1.2 *************** *** 1,2 **** --- 1,3 ---- + #! /usr/bin/env python """ Dan Bernstein's CDB implemented in Python *************** *** 28,34 **** --- 29,37 ---- def __init__(self, fp): + self.fp = fp fd = fp.fileno() self.size = os.fstat(fd).st_size self.map = mmap.mmap(fd, self.size, access=mmap.ACCESS_READ) + self.eod = uint32_unpack(self.map[:4]) self.findstart() self.loop = 0 # number of hash slots searched under this key *************** *** 44,47 **** --- 47,92 ---- self.map.close() + def __iter__(self, fn=None): + len = 2048 + ret = [] + while len < self.eod: + klen, vlen = struct.unpack("%s" % (len(key), len(value), key, value) + print def cdb_make(outfile, items): From rubiconx@users.sourceforge.net Wed Sep 11 07:58:06 2002 From: rubiconx@users.sourceforge.net (Neale Pickett) Date: Tue, 10 Sep 2002 23:58:06 -0700 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.15,1.16 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv31437 Modified Files: tokenizer.py Log Message: textparts() now makes a copy of payloads. This keeps the tokenizer from fouling up the message object's payload(s). Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.15 retrieving revision 1.16 diff -C2 -d -r1.15 -r1.16 *** tokenizer.py 10 Sep 2002 18:15:49 -0000 1.15 --- tokenizer.py 11 Sep 2002 06:58:03 -0000 1.16 *************** *** 506,510 **** # part to redundant_html. htmlpart = textpart = None ! stack = part.get_payload() while stack: subpart = stack.pop() --- 506,510 ---- # part to redundant_html. htmlpart = textpart = None ! stack = part.get_payload()[:] while stack: subpart = stack.pop() From tim_one@users.sourceforge.net Thu Sep 12 01:16:09 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Wed, 11 Sep 2002 17:16:09 -0700 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.16,1.17 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv2055 Modified Files: tokenizer.py Log Message: Added code to strip uuencoded sections. As reported on the mailing list, this has no effect on my results, except that one spam in now judged as ham by all the other training sets. It shrinks the database size by a few percent, so that makes it a tiny win. If Anthony Baxter doesn't report better results on his data, I'll be sorely tempted to throw this out again. Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.16 retrieving revision 1.17 diff -C2 -d -r1.16 -r1.17 *** tokenizer.py 11 Sep 2002 06:58:03 -0000 1.16 --- tokenizer.py 12 Sep 2002 00:16:07 -0000 1.17 *************** *** 747,750 **** --- 747,787 ---- yield '.'.join(parts[:i]) + uuencode_begin_re = re.compile(r""" + ^begin \s+ + (\S+) \s+ # capture mode + (\S+) \s* # capture filename + $ + """, re.VERBOSE | re.MULTILINE) + + uuencode_end_re = re.compile(r"^end\s*\n", re.MULTILINE) + + # Strip out uuencoded sections and produce tokens. The return value + # is (new_text, sequence_of_tokens), where new_text no longer contains + # uuencoded stuff. Note that we're not bothering to decode it! Maybe + # we should. + def crack_uuencode(text): + new_text = [] + tokens = [] + i = 0 + while True: + # Invariant: Through text[:i], all non-uuencoded text is in + # new_text, and tokens contains summary clues for all uuencoded + # portions. text[i:] hasn't been looked at yet. + m = uuencode_begin_re.search(text, i) + if not m: + new_text.append(text[i:]) + break + start, end = m.span() + new_text.append(text[i : start]) + mode, fname = m.groups() + tokens.append('uuencode mode:%s' % mode) + tokens.extend(['uuencode:%s' % x for x in crack_filename(fname)]) + m = uuencode_end_re.search(text, end) + if not m: + break + i = m.end() + + return ''.join(new_text), tokens + class Tokenizer: *************** *** 881,884 **** --- 918,926 ---- # Normalize case. text = text.lower() + + # Get rid of uuencoded sections. + text, tokens = crack_uuencode(text) + for t in tokens: + yield t # Special tagging of embedded URLs. From tim_one@users.sourceforge.net Thu Sep 12 03:46:17 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Wed, 11 Sep 2002 19:46:17 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.9,1.10 mboxtest.py,1.2,1.3timtest.py,1.20,1.21 tokenizer.py,1.17,1.18 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv5892 Modified Files: Options.py mboxtest.py timtest.py tokenizer.py Log Message: Added option TestDriver/show_charlimit to put a bound on the length of displayed msgs. Default is 5000. The similar cmdline option to mboxtest has gone away. Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.9 retrieving revision 1.10 diff -C2 -d -r1.9 -r1.10 *** Options.py 11 Sep 2002 00:22:56 -0000 1.9 --- Options.py 12 Sep 2002 02:46:15 -0000 1.10 *************** *** 75,78 **** --- 75,82 ---- show_best_discriminators: True + # The maximum # of characters to display for a msg displayed due to the + # show_xyz options above. + show_charlimit: 3000 + # If save_trained_pickles is true, Driver.train() saves a binary pickle # of the classifier after training. The file basename is given by *************** *** 109,112 **** --- 113,117 ---- 'save_trained_pickles': boolean_cracker, 'pickle_basename': string_cracker, + 'show_charlimit': int_cracker, }, } Index: mboxtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/mboxtest.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** mboxtest.py 7 Sep 2002 16:17:19 -0000 1.2 --- mboxtest.py 12 Sep 2002 02:46:15 -0000 1.3 *************** *** 17,23 **** -m MSGS Read no more than MSGS messages from mailbox. - - -l LIMIT - Print no more than LIMIT characters of a message in test output. """ --- 17,20 ---- *************** *** 137,142 **** SEED = 101 MAXMSGS = None ! CHARLIMIT = 1000 ! opts, args = getopt.getopt(args, "f:n:s:l:m:") for k, v in opts: if k == '-f': --- 134,138 ---- SEED = 101 MAXMSGS = None ! opts, args = getopt.getopt(args, "f:n:s:m:") for k, v in opts: if k == '-f': *************** *** 146,151 **** if k == '-s': SEED = int(v) - if k == '-l': - CHARLIMIT = int(v) if k == '-m': MAXMSGS = int(v) --- 142,145 ---- *************** *** 177,181 **** if (iham, ispam) == (ihtest, istest): continue ! driver.test(mbox(ham, ihtest), mbox(spam, istest), CHARLIMIT) driver.finishtest() driver.alldone() --- 171,175 ---- if (iham, ispam) == (ihtest, istest): continue ! driver.test(mbox(ham, ihtest), mbox(spam, istest)) driver.finishtest() driver.alldone() Index: timtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timtest.py,v retrieving revision 1.20 retrieving revision 1.21 diff -C2 -d -r1.20 -r1.21 *** timtest.py 11 Sep 2002 00:22:56 -0000 1.20 --- timtest.py 12 Sep 2002 02:46:15 -0000 1.21 *************** *** 81,85 **** spam.display() ! def printmsg(msg, prob, clues, charlimit=None): print msg.tag print "prob =", prob --- 81,85 ---- spam.display() ! def printmsg(msg, prob, clues): print msg.tag print "prob =", prob *************** *** 88,93 **** print guts = str(msg) ! if charlimit is not None: ! guts = guts[:charlimit] print guts --- 88,93 ---- print guts = str(msg) ! if options.show_charlimit > 0: ! guts = guts[:options.show_charlimit] print guts *************** *** 185,189 **** printhist("all runs:", self.global_ham_hist, self.global_spam_hist) ! def test(self, ham, spam, charlimit=None): c = self.classifier t = self.tester --- 185,189 ---- printhist("all runs:", self.global_ham_hist, self.global_spam_hist) ! def test(self, ham, spam): c = self.classifier t = self.tester *************** *** 198,202 **** print "Ham with prob =", prob prob, clues = c.spamprob(msg, True) ! printmsg(msg, prob, clues, charlimit) def new_spam(msg, prob, lo=options.show_spam_lo, --- 198,202 ---- print "Ham with prob =", prob prob, clues = c.spamprob(msg, True) ! printmsg(msg, prob, clues) def new_spam(msg, prob, lo=options.show_spam_lo, *************** *** 207,211 **** print "Spam with prob =", prob prob, clues = c.spamprob(msg, True) ! printmsg(msg, prob, clues, charlimit) t.reset_test_results() --- 207,211 ---- print "Spam with prob =", prob prob, clues = c.spamprob(msg, True) ! printmsg(msg, prob, clues) t.reset_test_results() *************** *** 226,230 **** print '*' * 78 prob, clues = c.spamprob(e, True) ! printmsg(e, prob, clues, charlimit) newfneg = Set(t.false_negatives()) - self.falseneg --- 226,230 ---- print '*' * 78 prob, clues = c.spamprob(e, True) ! printmsg(e, prob, clues) newfneg = Set(t.false_negatives()) - self.falseneg Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.17 retrieving revision 1.18 diff -C2 -d -r1.17 -r1.18 *** tokenizer.py 12 Sep 2002 00:16:07 -0000 1.17 --- tokenizer.py 12 Sep 2002 02:46:15 -0000 1.18 *************** *** 613,617 **** for i in xrange(n-4): yield "5g:" + word[i : i+5] ! else: # It's a long string of "normal" chars. Ignore it. --- 613,634 ---- for i in xrange(n-4): yield "5g:" + word[i : i+5] ! """ ! # If there are any high-bit chars, tokenize it as byte 3-grams. ! # XXX This really won't work for high-bit languages -- the scoring ! # XXX scheme throws almost everything away, and one bad phrase can ! # XXX generate enough bad 3-grams to dominate the final score. ! # XXX This also increases the database size substantially. ! elif has_highbit_char(word): ! counthi = 0 ! ch1 = ch2 = '' ! for ch in word: ! if ord(ch) >= 128: ! counthi += 1 ! yield "3g:%s" % (ch1 + ch2 + ch) ! ch1 = ch2 ! ch2 = ch ! ratio = round(counthi * 20.0 / len(word)) * 5 ! yield "8bit%%:%d" % ratio ! """ else: # It's a long string of "normal" chars. Ignore it. From tim_one@users.sourceforge.net Thu Sep 12 03:58:04 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Wed, 11 Sep 2002 19:58:04 -0700 Subject: [Spambayes-checkins] spambayes timtest.py,1.21,1.22 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv8497 Modified Files: timtest.py Log Message: Missed a call to printmsg that was still passing a charlimit (show_charlimit is an option now). Index: timtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timtest.py,v retrieving revision 1.21 retrieving revision 1.22 diff -C2 -d -r1.21 -r1.22 *** timtest.py 12 Sep 2002 02:46:15 -0000 1.21 --- timtest.py 12 Sep 2002 02:58:02 -0000 1.22 *************** *** 236,240 **** print '*' * 78 prob, clues = c.spamprob(e, True) ! printmsg(e, prob, clues, 1000) if options.show_best_discriminators: --- 236,240 ---- print '*' * 78 prob, clues = c.spamprob(e, True) ! printmsg(e, prob, clues) if options.show_best_discriminators: From tim_one@users.sourceforge.net Thu Sep 12 05:19:41 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Wed, 11 Sep 2002 21:19:41 -0700 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.18,1.19 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv26685 Modified Files: tokenizer.py Log Message: Two things: 1) Gave up on 5-gram'ming of long words w/ high-bit chars. This approach didn't make sense for high-bit languages regardless, and the results here show it wasn't doing any good that couldn't be gotten cheaper. There may even be a slight f-n rate improvement now. This also chops about 2MB off the database size on my runs. 2) Removed http:// etc thingies; they're already getting parsed specially. Leaving them in the body of the text was likely to lead to redundant "skip:< nn" and "skip:h nn" tokens, giving an artificial boost (whether towards ham or spam doesn't matter) to msgs simply containing URLs. I still need to fix now-out-of-date comments. false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.050 0.050 tied 0.000 0.000 tied 0.025 0.025 tied 0.000 0.000 tied 0.075 0.075 tied 0.025 0.025 tied 0.025 0.025 tied 0.000 0.000 tied 0.050 0.050 tied 0.000 0.000 tied 0.025 0.025 tied 0.000 0.000 tied 0.000 0.000 tied 0.050 0.050 tied 0.025 0.025 tied 0.000 0.000 tied 0.025 0.025 tied 0.050 0.050 tied won 0 times tied 20 times lost 0 times total unique fp went from 8 to 8 tied false negative percentages 0.255 0.218 won -14.51% 0.364 0.364 tied 0.291 0.291 tied 0.509 0.509 tied 0.436 0.400 won -8.26% 0.218 0.218 tied 0.218 0.218 tied 0.582 0.582 tied 0.327 0.291 won -11.01% 0.255 0.255 tied 0.291 0.291 tied 0.582 0.582 tied 0.545 0.545 tied 0.255 0.255 tied 0.291 0.255 won -12.37% 0.400 0.400 tied 0.291 0.291 tied 0.218 0.218 tied 0.218 0.182 won -16.51% 0.182 0.145 won -20.33% won 6 times tied 14 times lost 0 times total unique fn went from 90 to 86 won -4.44% Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.18 retrieving revision 1.19 diff -C2 -d -r1.18 -r1.19 *** tokenizer.py 12 Sep 2002 02:46:15 -0000 1.18 --- tokenizer.py 12 Sep 2002 04:19:38 -0000 1.19 *************** *** 588,592 **** def tokenize_word(word, _len=len): n = _len(word) - # Make sure this range matches in tokenize(). if 3 <= n <= 12: --- 588,591 ---- *************** *** 604,638 **** yield 'email addr:' + p2 - # If there are any high-bit chars, - # tokenize it as byte 5-grams. - # XXX This really won't work for high-bit languages -- the scoring - # XXX scheme throws almost everything away, and one bad phrase can - # XXX generate enough bad 5-grams to dominate the final score. - # XXX This also increases the database size substantially. - elif has_highbit_char(word): - for i in xrange(n-4): - yield "5g:" + word[i : i+5] - """ - # If there are any high-bit chars, tokenize it as byte 3-grams. - # XXX This really won't work for high-bit languages -- the scoring - # XXX scheme throws almost everything away, and one bad phrase can - # XXX generate enough bad 3-grams to dominate the final score. - # XXX This also increases the database size substantially. - elif has_highbit_char(word): - counthi = 0 - ch1 = ch2 = '' - for ch in word: - if ord(ch) >= 128: - counthi += 1 - yield "3g:%s" % (ch1 + ch2 + ch) - ch1 = ch2 - ch2 = ch - ratio = round(counthi * 20.0 / len(word)) * 5 - yield "8bit%%:%d" % ratio - """ else: - # It's a long string of "normal" chars. Ignore it. - # For example, it may be an embedded URL (which we already - # tagged), or a uuencoded line. # There's value in generating a token indicating roughly how # many chars were skipped. This has real benefit for the f-n --- 603,607 ---- *************** *** 641,644 **** --- 610,619 ---- # XXX this info has greater benefit. yield "skip:%c %d" % (word[0], n // 10 * 10) + if has_highbit_char(word): + hicount = 0 + for i in map(ord, word): + if i >= 128: + hicount += 1 + yield "8bit%%:%d" % round(hicount * 100.0 / len(word)) # Generate tokens for: *************** *** 801,804 **** --- 776,814 ---- return ''.join(new_text), tokens + def crack_urls(text): + new_text = [] + clues = [] + pushclue = clues.append + i = 0 + while True: + # Invariant: Through text[:i], all non-URL text is in new_text, and + # clues contains clues for all URLs. text[i:] hasn't been looked at + # yet. + m = url_re.search(text, i) + if not m: + new_text.append(text[i:]) + break + proto, guts = m.groups() + start, end = m.span() + new_text.append(text[i : start]) + new_text.append(' ') + + pushclue("proto:" + proto) + # Lose the trailing punctuation for casual embedding, like: + # The code is at http://mystuff.org/here? Didn't resolve. + # or + # I found it at http://mystuff.org/there/. Thanks! + assert guts + while guts and guts[-1] in '.:?!/': + guts = guts[:-1] + for i, piece in enumerate(guts.split('/')): + prefix = "%s%s:" % (proto, i < 2 and str(i) or '>1') + for chunk in urlsep_re.split(piece): + pushclue(prefix + chunk) + + i = end + + return ''.join(new_text), clues + class Tokenizer: *************** *** 942,958 **** # Special tagging of embedded URLs. ! for proto, guts in url_re.findall(text): ! yield "proto:" + proto ! # Lose the trailing punctuation for casual embedding, like: ! # The code is at http://mystuff.org/here? Didn't resolve. ! # or ! # I found it at http://mystuff.org/there/. Thanks! ! assert guts ! while guts and guts[-1] in '.:?!/': ! guts = guts[:-1] ! for i, piece in enumerate(guts.split('/')): ! prefix = "%s%s:" % (proto, i < 2 and str(i) or '>1') ! for chunk in urlsep_re.split(piece): ! yield prefix + chunk # Anthony Baxter reported goodness from tokenizing src= params. --- 952,958 ---- # Special tagging of embedded URLs. ! text, tokens = crack_urls(text) ! for t in tokens: ! yield t # Anthony Baxter reported goodness from tokenizing src= params. From gvanrossum@users.sourceforge.net Thu Sep 12 06:10:04 2002 From: gvanrossum@users.sourceforge.net (Guido van Rossum) Date: Wed, 11 Sep 2002 22:10:04 -0700 Subject: [Spambayes-checkins] spambayes hammie.py,1.15,1.16 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv6989 Modified Files: hammie.py Log Message: Use the _mh_msgno feature I just added to Python 2.3's mailbox.MHMailbox class, if available, to report the correct message number for spams in -u mode. Index: hammie.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammie.py,v retrieving revision 1.15 retrieving revision 1.16 diff -C2 -d -r1.15 -r1.16 *** hammie.py 8 Sep 2002 03:20:18 -0000 1.15 --- hammie.py 12 Sep 2002 05:10:02 -0000 1.16 *************** *** 251,257 **** prob, clues = bayes.spamprob(tokenize(msg), True) isspam = prob >= 0.9 if isspam: spams += 1 ! print "%6s %4.2f %1s" % (i, prob, isspam and "S" or "."), print formatclues(clues) else: --- 251,261 ---- prob, clues = bayes.spamprob(tokenize(msg), True) isspam = prob >= 0.9 + if hasattr(msg, '_mh_msgno'): + msgno = msg._mh_msgno + else: + msgno = i if isspam: spams += 1 ! print "%6s %4.2f %1s" % (msgno, prob, isspam and "S" or "."), print formatclues(clues) else: From anthony@interlink.com.au Thu Sep 12 08:13:20 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Thu, 12 Sep 2002 17:13:20 +1000 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.16,1.17 In-Reply-To: Message-ID: <200209120713.g8C7DLj24609@localhost.localdomain> >>> "Tim Peters" wrote > Modified Files: > tokenizer.py > Log Message: > Added code to strip uuencoded sections. As reported on the mailing list, > this has no effect on my results, except that one spam in now judged as > ham by all the other training sets. It shrinks the database size by a > few percent, so that makes it a tiny win. If Anthony Baxter doesn't > report better results on his data, I'll be sorely tempted to throw this > out again. I'd say nuke it: anthony_tok1.16s -> anthony_tok1.17s false positive percentages 0.778 0.778 tied 0.834 0.778 won -6.71% 0.890 0.890 tied 0.667 0.611 won -8.40% 1.112 1.112 tied 0.834 0.834 tied 0.723 0.723 tied 0.667 0.611 won -8.40% 1.167 1.167 tied 1.001 1.001 tied 0.779 0.779 tied 0.667 0.611 won -8.40% 0.778 0.778 tied 0.778 0.778 tied 0.556 0.556 tied 0.778 0.723 won -7.07% 0.611 0.611 tied 0.778 0.778 tied 0.723 0.723 tied 0.667 0.667 tied won 5 times tied 15 times lost 0 times total unique fp went from 143 to 141 won -1.40% false negative percentages 0.646 0.646 tied 0.904 0.904 tied 0.517 0.581 lost +12.38% 1.229 1.229 tied 0.840 0.840 tied 1.033 1.033 tied 0.711 0.775 lost +9.00% 1.164 1.164 tied 0.646 0.646 tied 0.711 0.711 tied 0.646 0.711 lost +10.06% 0.517 0.517 tied 0.776 0.776 tied 0.646 0.646 tied 0.904 0.904 tied 1.035 1.035 tied 0.582 0.582 tied 0.581 0.581 tied 0.775 0.775 tied 0.646 0.646 tied won 0 times tied 17 times lost 3 times From rubiconx@users.sourceforge.net Thu Sep 12 08:24:55 2002 From: rubiconx@users.sourceforge.net (Neale Pickett) Date: Thu, 12 Sep 2002 00:24:55 -0700 Subject: [Spambayes-checkins] spambayes cdbhammie.py,NONE,1.1 cdbwrap.py,NONE,1.1 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv5165 Added Files: cdbhammie.py cdbwrap.py Log Message: A version of hammie to use CDB. Something may be wrong with it--the databases it creates are *gargantuan*. But it works. --- NEW FILE: cdbhammie.py --- #! /usr/bin/env python # At the moment, this requires Python 2.3 from CVS # A driver for the classifier module and Tim's tokenizer that you can # call from procmail. This one uses Neil's cdb module. Will it be # faster than Berkeley DB hashes? """Usage: %(program)s [options] Where: -h show usage and exit -g PATH mbox or directory of known good messages (non-spam) to train on. -s PATH mbox or directory of known spam messages to train on. -u PATH mbox of unknown messages. A ham/spam decision is reported for each. -p FILE use file as the persistent store. loads data from this file if it exists, and saves data to this file at the end. Default: %(DEFAULTDB)s -f run as a filter: read a single message from stdin, add an %(DISPHEADER)s header, and write it to stdout. """ import sys import os import getopt import mailbox import glob import email import classifier import errno import cdb import cPickle as pickle program = sys.argv[0] # For usage(); referenced by docstring above # Name of the header to add in filter mode DISPHEADER = "X-Hammie-Disposition" # Default database name DEFAULTDB = "hammie.db" # Tim's tokenizer kicks far more booty than anything I would have # written. Score one for analysis ;) from tokenizer import tokenize from cdbwrap import CDBShelf class CDBDict(CDBShelf): """Constant Database Dictionary This wraps a cdb to make it look even more like a dictionary. Call it with the name of your database file. Optionally, you can specify a list of keys to skip when iterating. This only affects iterators; things like .keys() still list everything. For instance: >>> d = DBDict('/tmp/goober.db', ('skipme', 'skipmetoo')) >>> d['skipme'] = 'booga' >>> d['countme'] = 'wakka' >>> print d.keys() ['skipme', 'countme'] >>> for k in d.iterkeys(): ... print k countme """ def __init__(self, dbname, iterskip=()): CDBShelf.__init__(self, dbname) self.iterskip = iterskip def __iter__(self, fn=lambda k,v: (k,v)): for key in self.dict.iterkeys(): val = self.get(key) if key not in self.iterskip: yield fn(key, val) def __setitem__(self, key, value): v = pickle.dumps(value, 1) self.dict[key] = v def iteritems(self): return self.__iter__() def iterkeys(self): return self.__iter__(lambda k,v: k) def itervalues(self): return self.__iter__(lambda k,v: v) def items(self): ret = [] for i in self.iteritems(): ret.append(i) return ret def keys(self): ret = [] for i in self.iterkeys(): ret.append(i) return ret def values(self): ret = [] for i in self.itervalues(): ret.append(i) return ret def __contains__(self, name): return self.has_key(name) class PersistentGrahamBayes(classifier.GrahamBayes): """A persistent GrahamBayes classifier This is just like classifier.GrahamBayes, except that the dictionary is a database. You take less disk this way, I think, and you can pretend it's persistent. It's much slower training, but much faster checking, and takes less memory all around. On destruction, an instantiation of this class will write it's state to a special key. When you instantiate a new one, it will attempt to read these values out of that key again, so you can pick up where you left off. """ # XXX: Would it be even faster to remember (in a list) which keys # had been modified, and only recalculate those keys? No sense in # going over the entire word database if only 100 words are # affected. # XXX: Another idea: cache stuff in memory. But by then maybe we # should just use ZODB. def __init__(self, dbname): classifier.GrahamBayes.__init__(self) self.statekey = "saved state" self.wordinfo = CDBDict(dbname, (self.statekey,)) self.restore_state() def __del__(self): #super.__del__(self) self.save_state() def save_state(self): self.wordinfo[self.statekey] = (self.nham, self.nspam) def restore_state(self): if self.wordinfo.has_key(self.statekey): self.nham, self.nspam = self.wordinfo[self.statekey] class DirOfTxtFileMailbox: """Mailbox directory consisting of .txt files.""" def __init__(self, dirname, factory): self.names = glob.glob(os.path.join(dirname, "*.txt")) self.factory = factory def __iter__(self): for name in self.names: try: f = open(name) except IOError: continue yield self.factory(f) f.close() def getmbox(msgs): """Return an iterable mbox object given a file/directory/folder name.""" def _factory(fp): try: return email.message_from_file(fp) except email.Errors.MessageParseError: return '' if msgs.startswith("+"): import mhlib mh = mhlib.MH() mbox = mailbox.MHMailbox(os.path.join(mh.getpath(), msgs[1:]), _factory) elif os.path.isdir(msgs): # XXX Bogus: use an MHMailbox if the pathname contains /Mail/, # else a DirOfTxtFileMailbox. if msgs.find("/Mail/") >= 0: mbox = mailbox.MHMailbox(msgs, _factory) else: mbox = DirOfTxtFileMailbox(msgs, _factory) else: fp = open(msgs) mbox = mailbox.PortableUnixMailbox(fp, _factory) return mbox def train(bayes, msgs, is_spam): """Train bayes with all messages from a mailbox.""" mbox = getmbox(msgs) i = 0 for msg in mbox: i += 1 # XXX: Is the \r a Unixism? I seem to recall it working in DOS # back in the day. Maybe it's a line-printer-ism ;) sys.stdout.write("\r%6d" % i) sys.stdout.flush() bayes.learn(tokenize(msg), is_spam, False) print def formatclues(clues, sep="; "): """Format the clues into something readable.""" return sep.join(["%r: %.2f" % (word, prob) for word, prob in clues]) def filter(bayes, input, output): """Filter (judge) a message""" msg = email.message_from_file(input) prob, clues = bayes.spamprob(tokenize(msg), True) if prob < 0.9: disp = "No" else: disp = "Yes" disp += "; %.2f" % prob disp += "; " + formatclues(clues) msg.add_header(DISPHEADER, disp) output.write(msg.as_string(unixfrom=(msg.get_unixfrom() is not None))) def score(bayes, msgs): """Score (judge) all messages from a mailbox.""" # XXX The reporting needs work! mbox = getmbox(msgs) i = 0 spams = hams = 0 for msg in mbox: i += 1 prob, clues = bayes.spamprob(tokenize(msg), True) isspam = prob >= 0.9 if isspam: spams += 1 print "%6s %4.2f %1s" % (i, prob, isspam and "S" or "."), print formatclues(clues) else: hams += 1 print "Total %d spam, %d ham" % (spams, hams) def usage(code, msg=''): """Print usage message and sys.exit(code).""" if msg: print >> sys.stderr, msg print >> sys.stderr print >> sys.stderr, __doc__ % globals() sys.exit(code) def main(): """Main program; parse options and go.""" try: opts, args = getopt.getopt(sys.argv[1:], 'hdfg:s:p:u:') except getopt.error, msg: usage(2, msg) if not opts: usage(2, "No options given") pck = DEFAULTDB good = spam = unknown = None do_filter = usedb = False for opt, arg in opts: if opt == '-h': usage(0) elif opt == '-g': good = arg elif opt == '-s': spam = arg elif opt == '-p': pck = arg elif opt == "-d": usedb = True elif opt == "-f": do_filter = True elif opt == '-u': unknown = arg if args: usage(2, "Positional arguments not allowed") save = False if usedb: bayes = PersistentGrahamBayes(pck) else: bayes = None try: fp = open(pck, 'rb') except IOError, e: if e.errno <> errno.ENOENT: raise else: bayes = pickle.load(fp) fp.close() if bayes is None: bayes = classifier.GrahamBayes() if good: print "Training ham:" train(bayes, good, False) save = True if spam: print "Training spam:" train(bayes, spam, True) save = True if save: bayes.update_probabilities() if not usedb and pck: fp = open(pck, 'wb') pickle.dump(bayes, fp, 1) fp.close() if do_filter: filter(bayes, sys.stdin, sys.stdout) if unknown: score(bayes, unknown) if __name__ == "__main__": main() --- NEW FILE: cdbwrap.py --- #! /usr/bin/env python import cdb import tempfile import struct import time import os import shelve from sets import Set class DELITEM: # Special class to signify a deleted item pass class CDBDict: def __init__(self, filename): self.filename = filename try: self.fp = open(filename, "rb") self.db = cdb.Cdb(self.fp) except: self.fp = None self.db = {} self.cache = {} self.newkeys = [] def __delitem__(self, key): self[key] = DELITEM def __getitem__(self, key): val = self.cache.get(key) if val is DELITEM: raise KeyError, key if not val and self.db: val = self.db[key] return val def __setitem__(self, key, val): self.cache[key] = val if not self.db.get(key): self.newkeys.append(key) def __del__(self): if self.cache: import cdb if 1: newf = "%s.txt" % self.filename fp = open(newf, "wb") for key,value in self.iteritems(): fp.write("+%d,%d:%s->%s\n" % (len(key), len(value), key, value)) fp.write("\n") fp.close() else: # XXX: security risk, but how to do this without the symlink # problem? newf = "%s-%f" % (self.filename, time.time()) fp = open(newf, "wb") cdb.cdb_make(fp, self.iteritems()) fp.close() os.rename(newf, self.filename) def __iter__(self, fn=lambda k,v: (k,v)): for key in self.newkeys: val = self.cache[key] if val is DELITEM: continue else: yield fn(key, val) for key,val in self.db.iteritems(): nval = self.cache.get(key) if nval: if nval is DELITEM: continue else: yield fn(key, nval) else: yield fn(key, val) def __contains__(self, key): return self.has_key(key) def iteritems(self): return self.__iter__() def iterkeys(self): return self.__iter__(lambda k,v: k) def itervalues(self): return self.__iter__(lambda k,v: v) def items(self): ret = [] for i in self.iteritems(): ret.append(i) return ret def keys(self): ret = [] for i in self.iterkeys(): ret.append(i) return ret def values(self): ret = [] for i in self.itervalues(): ret.append(i) return ret def get(self, key, default=None): try: val = self[key] except KeyError: val = default return val def has_key(self, key): return self.get(key) and True class CDBShelf(shelve.Shelf): """Shelf implementation using a Constant Database. This is initialized with the filename for the CDB database. See the shelf module's __doc__ string for an overview of the interface. """ def __init__(self, filename, flag='c'): db = CDBDict(filename) shelve.Shelf.__init__(self, db) def test_shelf(): s = CDBShelf("shelf.cdb") print "foo ->", s.get("foo") s["foo"] = s.get("foo", 1.0) + .1 print "foo ->", s.get("foo") def test_dict(): db = CDBDict("services.cdb") one = db.get("1") if one: print 'db["1"] == %s; deleting' % one del db["1"] else: print 'db["1"] not set; setting' db["1"] = "One" print "New value is", db.get("1") if __name__ == "__main__": test_shelf() test_dict() From rubiconx@users.sourceforge.net Thu Sep 12 08:28:38 2002 From: rubiconx@users.sourceforge.net (Neale Pickett) Date: Thu, 12 Sep 2002 00:28:38 -0700 Subject: [Spambayes-checkins] spambayes cdbhammie.py,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv7172 Modified Files: cdbhammie.py Log Message: You don't need to specify -d to cdbhammie anymore. That is, it now works as advertised :) Index: cdbhammie.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/cdbhammie.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** cdbhammie.py 12 Sep 2002 07:24:53 -0000 1.1 --- cdbhammie.py 12 Sep 2002 07:28:36 -0000 1.2 *************** *** 278,283 **** elif opt == '-p': pck = arg - elif opt == "-d": - usedb = True elif opt == "-f": do_filter = True --- 278,281 ---- *************** *** 289,305 **** save = False ! if usedb: ! bayes = PersistentGrahamBayes(pck) ! else: ! bayes = None ! try: ! fp = open(pck, 'rb') ! except IOError, e: ! if e.errno <> errno.ENOENT: raise ! else: ! bayes = pickle.load(fp) ! fp.close() ! if bayes is None: ! bayes = classifier.GrahamBayes() if good: --- 287,291 ---- save = False ! bayes = PersistentGrahamBayes(pck) if good: *************** *** 314,321 **** if save: bayes.update_probabilities() - if not usedb and pck: - fp = open(pck, 'wb') - pickle.dump(bayes, fp, 1) - fp.close() if do_filter: --- 300,303 ---- From tim.one@comcast.net Thu Sep 12 15:47:47 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 12 Sep 2002 10:47:47 -0400 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.16,1.17 In-Reply-To: <200209120713.g8C7DLj24609@localhost.localdomain> Message-ID: [Tim] >> Modified Files: >> tokenizer.py >> Log Message: >> Added code to strip uuencoded sections. As reported on the mailing list, >> this has no effect on my results, except that one spam in now judged as >> ham by all the other training sets. It shrinks the database size by a >> few percent, so that makes it a tiny win. If Anthony Baxter doesn't >> report better results on his data, I'll be sorely tempted to throw this >> out again. [Anthony Baxter] > I'd say nuke it: > > false positive percentages > 0.778 0.778 tied > 0.834 0.778 won -6.71% > 0.890 0.890 tied > 0.667 0.611 won -8.40% > 1.112 1.112 tied > 0.834 0.834 tied > 0.723 0.723 tied > 0.667 0.611 won -8.40% > 1.167 1.167 tied > 1.001 1.001 tied > 0.779 0.779 tied > 0.667 0.611 won -8.40% > 0.778 0.778 tied > 0.778 0.778 tied > 0.556 0.556 tied > 0.778 0.723 won -7.07% > 0.611 0.611 tied > 0.778 0.778 tied > 0.723 0.723 tied > 0.667 0.667 tied > > won 5 times > tied 15 times > lost 0 times > > total unique fp went from 143 to 141 won -1.40% > > false negative percentages > 0.646 0.646 tied > 0.904 0.904 tied > 0.517 0.581 lost +12.38% > 1.229 1.229 tied > 0.840 0.840 tied > 1.033 1.033 tied > 0.711 0.775 lost +9.00% > 1.164 1.164 tied > 0.646 0.646 tied > 0.711 0.711 tied > 0.646 0.711 lost +10.06% > 0.517 0.517 tied > 0.776 0.776 tied > 0.646 0.646 tied > 0.904 0.904 tied > 1.035 1.035 tied > 0.582 0.582 tied > 0.581 0.581 tied > 0.775 0.775 tied > 0.646 0.646 tied > > won 0 times > tied 17 times > lost 3 times So there's one spam in your Set4 that gets through when scored by Sets 1, 2 or 3 now, but two hams that are no longer called spam by any training set. That's a small win, so I'm inclined to leave it in after all (it's a cheap transformation, and keeps a bunch of useless "skip" tokens out of the database). From montanaro@users.sourceforge.net Thu Sep 12 20:33:56 2002 From: montanaro@users.sourceforge.net (Skip Montanaro) Date: Thu, 12 Sep 2002 12:33:56 -0700 Subject: [Spambayes-checkins] spambayes rebal.py,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv1160 Modified Files: rebal.py Log Message: nearly complete rewrite which attempts to achieve the following: * allows specification of reservoir directory and prefix of set directories * will automatically fill any set directories which match the -s pattern * will migrate files in either direction - in theory, no files should be deleted * should be a bit more efficient so varying the numbers of trained ham and spam shouldn't be a big problem With no args it should work like the original Index: rebal.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/rebal.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** rebal.py 5 Sep 2002 16:16:43 -0000 1.1 --- rebal.py 12 Sep 2002 19:33:54 -0000 1.2 *************** *** 1,58 **** ! import os ! import sys ! import random - ''' - dead = """ - Data/Ham/Set2/22467.txt - Data/Ham/Set5/31389.txt - Data/Ham/Set1/19642.txt """ ! for f in dead.split(): ! os.unlink(f) ! sys.exit(0) ! ''' NPERDIR = 4000 RESDIR = 'Data/Ham/reservoir' ! res = os.listdir(RESDIR) ! stuff = [] ! for i in range(1, 6): ! dir = 'Data/Ham/Set%d' % i ! fs = os.listdir(dir) ! stuff.append((dir, fs)) ! while stuff: ! dir, fs = stuff.pop() ! if len(fs) == NPERDIR: ! continue ! if len(fs) > NPERDIR: ! f = random.choice(fs) ! fs.remove(f) ! print "deleting", f, "from", dir ! os.unlink(dir + "/" + f) ! elif len(fs) < NPERDIR: ! print "need a new one for", dir ! f = random.choice(res) ! print "How about", f ! res.remove(f) ! fp = file(RESDIR + "/" + f, 'rb') ! guts = fp.read() ! fp.close() ! os.unlink(RESDIR + "/" + f) ! print guts ! ok = raw_input('good enough? ') ! if ok.startswith('y'): ! fp = file(dir + "/" + f, 'wb') ! fp.write(guts) ! fp.close() ! fs.append(f) ! stuff.append((dir, fs)) --- 1,166 ---- ! #!/usr/bin/env python """ + rebal.py - rebalance a ham or spam directory, moving files to or from + a reservoir directory as necessary. ! usage: rebal.py [ options ] ! options: ! -r res - specify an alternate reservoir [%(RESDIR)s] ! -s set - specify an alternate Set pfx [%(SETPFX)s] ! -n num - specify number of files per dir [%(NPERDIR)s] ! -v - tell user what's happening [%(VERBOSE)s] ! -q - be quiet about what's happening [not %(VERBOSE)s] ! -c - confirm file moves into Set directory [%(CONFIRM)s] ! -Q - be quiet and don't confirm moves ! The script will work with a variable number of Set directories, but they ! must already exist. ! ! Example: ! ! rebal.py -r reservoir -s Set -n 300 ! ! This will move random files between the directory 'reservoir' and the ! various subdirectories prefixed with 'Set', making sure no more than 300 ! files are left in the 'Set' directories when finished. ! ! Example: ! ! Suppose you want to shuffle your Set files around, winding up with 300 files ! in each one, you can execute: ! ! rebal.py -n 0 ! rebal.py -n 300 ! ! The first run will move all files from the various Data/Ham/Set directories ! to the Data/Ham/reservoir directory. The second run will randomly parcel ! out 300 files to each of the Data/Ham/Set directories. ! """ ! ! import os ! import sys ! import random ! import glob ! import getopt + # defaults NPERDIR = 4000 RESDIR = 'Data/Ham/reservoir' ! SETPFX = 'Data/Ham/Set' ! VERBOSE = True ! CONFIRM = True ! def usage(): ! print >> sys.stderr, """\ ! usage: rebal.py [ options ] ! options: ! -r res - specify an alternate reservoir [%(RESDIR)s] ! -s set - specify an alternate Set pfx [%(SETPFX)s] ! -n num - specify number of files per dir [%(NPERDIR)s] ! -v - tell user what's happening [%(VERBOSE)s] ! -q - be quiet about what's happening [not %(VERBOSE)s] ! -c - confirm file moves into Set directory [%(CONFIRM)s] ! -Q - be quiet and don't confirm moves ! """ % globals() ! ! def migrate(f, dir, verbose): ! """rename f into dir, making sure to avoid name clashes.""" ! base = os.path.split(f)[-1] ! if os.path.exists(os.path.join(dir,base)): ! # this path can get slow if we have a lot of name collisions ! # but we should rarely encounter that case (so he says smugly) ! reslist = [int(n) for n in os.listdir(dir)] ! reslist.sort() ! out = os.path.join(dir, "%d"%(reslist[-1]+1)) ! else: ! out = os.path.join(dir, base) ! if verbose: ! print "moving", f, "to", out ! os.rename(f, out) ! ! def main(args): ! nperdir = NPERDIR ! resdir = RESDIR ! setpfx = SETPFX ! verbose = VERBOSE ! confirm = CONFIRM ! ! try: ! opts, args = getopt.getopt(args, "r:s:n:vqcQh") ! except getopt.GetoptError: ! usage() ! return 1 ! for opt, arg in opts: ! if opt == "-n": ! nperdir = int(arg) ! elif opt == "-r": ! resdir = arg ! elif opt == "-s": ! setpfx = arg ! elif opt == "-v": ! verbose = True ! elif opt == "-c": ! confirm = True ! elif opt == "-q": ! verbose = False ! elif opt == "-Q": ! verbose = confirm = False ! elif opt == "-h": ! usage() ! return 0 ! res = os.listdir(resdir) ! dirs = glob.glob(setpfx+"*") ! if dirs == []: ! print >> sys.stderr, "no directories beginning with", setpfx, "exist." ! return 1 ! stuff = [] ! n = len(res) ! for dir in dirs: ! fs = os.listdir(dir) ! n += len(fs) ! stuff.append((dir, fs)) ! if nperdir * len(dirs) > n: ! print >> sys.stderr, "not enough files to go around - use lower -n." ! return 1 ! # if necessary, migrate random files to the reservoir ! for (dir, fs) in stuff: ! if nperdir >= len(fs): ! continue ! ! random.shuffle(fs) ! movethese = fs[nperdir:] ! del fs[nperdir:] ! for f in movethese: ! migrate(os.path.join(dir,f), resdir, verbose) ! res.extend(movethese) ! ! # randomize reservoir once so we can just bite chunks from the front ! random.shuffle(res) ! ! # grow Set* directories from the reservoir ! for (dir, fs) in stuff: ! if nperdir == len(fs): ! continue ! ! movethese = res[:nperdir-len(fs)] ! res = res[nperdir-len(fs):] ! for f in movethese: ! if confirm: ! print file(os.path.join(resdir,f)).read() ! ok = raw_input('good enough? ').lower() ! if not ok.startswith('y'): ! continue ! migrate(os.path.join(resdir,f), dir, verbose) ! fs.extend(movethese) ! ! return 0 ! ! if __name__ == "__main__": ! sys.exit(main(sys.argv[1:])) From montanaro@users.sourceforge.net Thu Sep 12 20:35:16 2002 From: montanaro@users.sourceforge.net (Skip Montanaro) Date: Thu, 12 Sep 2002 12:35:16 -0700 Subject: [Spambayes-checkins] spambayes cmp.py,1.6,1.7 rates.py,1.2,1.3 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv1522 Modified Files: cmp.py rates.py Log Message: add #! lines Index: cmp.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/cmp.py,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** cmp.py 8 Sep 2002 18:38:59 -0000 1.6 --- cmp.py 12 Sep 2002 19:35:14 -0000 1.7 *************** *** 1,2 **** --- 1,4 ---- + #!/usr/bin/env python + """ cmp.py sbase1 sbase2 Index: rates.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/rates.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** rates.py 7 Sep 2002 16:39:04 -0000 1.2 --- rates.py 12 Sep 2002 19:35:14 -0000 1.3 *************** *** 1,2 **** --- 1,4 ---- + #!/usr/bin/env python + """ rates.py basename From tim_one@users.sourceforge.net Fri Sep 13 00:59:08 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Thu, 12 Sep 2002 16:59:08 -0700 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.19,1.20 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv18626 Modified Files: tokenizer.py Log Message: crack_urls(): Simpler tagging of embedded http etc thingies. Test results show that the fine distinctions being drawn were a waste of code: false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.050 0.025 won -50.00% 0.000 0.000 tied 0.025 0.025 tied 0.000 0.000 tied 0.075 0.075 tied 0.025 0.025 tied 0.025 0.025 tied 0.000 0.000 tied 0.050 0.025 won -50.00% 0.000 0.000 tied 0.025 0.025 tied 0.000 0.000 tied 0.000 0.000 tied 0.050 0.050 tied 0.025 0.025 tied 0.000 0.000 tied 0.025 0.025 tied 0.050 0.025 won -50.00% won 3 times tied 17 times lost 0 times total unique fp went from 8 to 8 tied false negative percentages 0.218 0.218 tied 0.364 0.364 tied 0.291 0.327 lost +12.37% 0.509 0.545 lost +7.07% 0.400 0.400 tied 0.218 0.218 tied 0.218 0.218 tied 0.582 0.545 won -6.36% 0.291 0.291 tied 0.255 0.255 tied 0.291 0.291 tied 0.582 0.582 tied 0.545 0.545 tied 0.255 0.255 tied 0.255 0.255 tied 0.400 0.400 tied 0.291 0.291 tied 0.218 0.218 tied 0.182 0.182 tied 0.145 0.182 lost +25.52% won 1 times tied 16 times lost 3 times total unique fn went from 86 to 87 lost +1.16% Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.19 retrieving revision 1.20 diff -C2 -d -r1.19 -r1.20 *** tokenizer.py 12 Sep 2002 04:19:38 -0000 1.19 --- tokenizer.py 12 Sep 2002 23:59:06 -0000 1.20 *************** *** 802,809 **** while guts and guts[-1] in '.:?!/': guts = guts[:-1] ! for i, piece in enumerate(guts.split('/')): ! prefix = "%s%s:" % (proto, i < 2 and str(i) or '>1') for chunk in urlsep_re.split(piece): ! pushclue(prefix + chunk) i = end --- 802,808 ---- while guts and guts[-1] in '.:?!/': guts = guts[:-1] ! for piece in guts.split('/'): for chunk in urlsep_re.split(piece): ! pushclue("url:" + chunk) i = end From tim_one@users.sourceforge.net Fri Sep 13 01:14:21 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Thu, 12 Sep 2002 17:14:21 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.10,1.11 classifier.py,1.5,1.6 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv21941 Modified Files: Options.py classifier.py Log Message: Added new options section [Classifier], allowing to change HAMBIAS, SPAMBIAS, MIN_SPAMPROB, MAX_SPAMPROB, UNKNOWN_SPAMPROB and MAX_DISCRIMINATORS. Play with them at your own risk . Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.10 retrieving revision 1.11 diff -C2 -d -r1.10 -r1.11 *** Options.py 12 Sep 2002 02:46:15 -0000 1.10 --- Options.py 13 Sep 2002 00:14:18 -0000 1.11 *************** *** 89,92 **** --- 89,103 ---- save_trained_pickles: False pickle_basename: class + + [Classifier] + # Fiddling these can have extreme effects. See classifier.py for comments. + hambias: 2.0 + spambias: 1.0 + + min_spamprob: 0.01 + max_spamprob: 0.99 + unknown_spamprob: 0.5 + + max_discriminators: 16 """ *************** *** 115,118 **** --- 126,136 ---- 'show_charlimit': int_cracker, }, + 'Classifier': {'hambias': float_cracker, + 'spambias': float_cracker, + 'min_spamprob': float_cracker, + 'max_spamprob': float_cracker, + 'unknown_spamprob': float_cracker, + 'max_discriminators': int_cracker, + }, } Index: classifier.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/classifier.py,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** classifier.py 8 Sep 2002 03:17:31 -0000 1.5 --- classifier.py 13 Sep 2002 00:14:18 -0000 1.6 *************** *** 10,13 **** --- 10,15 ---- from sets import Set + from Options import options + # The count of each word in ham is artificially boosted by a factor of # HAMBIAS, and similarly for SPAMBIAS. Graham uses 2.0 and 1.0. Final *************** *** 26,31 **** # total unique false negatives goes up by a factor of 2.1 (337 -> 702) ! HAMBIAS = 2.0 ! SPAMBIAS = 1.0 # "And then there is the question of what probability to assign to words --- 28,33 ---- # total unique false negatives goes up by a factor of 2.1 (337 -> 702) ! HAMBIAS = options.hambias # 2.0 ! SPAMBIAS = options.spambias # 1.0 # "And then there is the question of what probability to assign to words *************** *** 35,40 **** # of training data is good enough to justify probabilities of 0 or 1. It # may justify probabilities outside this range, though. ! MIN_SPAMPROB = 0.01 ! MAX_SPAMPROB = 0.99 # The spam probability assigned to words never seen before. Graham used --- 37,42 ---- # of training data is good enough to justify probabilities of 0 or 1. It # may justify probabilities outside this range, though. ! MIN_SPAMPROB = options.min_spamprob # 0.01 ! MAX_SPAMPROB = options.max_spamprob # 0.99 # The spam probability assigned to words never seen before. Graham used *************** *** 50,54 **** # of kicking out a word with a prob in (0.2, 0.8), and that seems dubious # on the face of it. ! UNKNOWN_SPAMPROB = 0.5 # "I only consider words that occur more than five times in total". --- 52,56 ---- # of kicking out a word with a prob in (0.2, 0.8), and that seems dubious # on the face of it. ! UNKNOWN_SPAMPROB = options.unknown_spamprob # 0.5 # "I only consider words that occur more than five times in total". *************** *** 172,176 **** # was a pure win, lowering the false negative rate consistently, and it even # managed to tickle a couple rare false positives into "not spam" terrority. ! MAX_DISCRIMINATORS = 16 PICKLE_VERSION = 1 --- 174,178 ---- # was a pure win, lowering the false negative rate consistently, and it even # managed to tickle a couple rare false positives into "not spam" terrority. ! MAX_DISCRIMINATORS = options.max_discriminators # 16 PICKLE_VERSION = 1 From tim_one@users.sourceforge.net Fri Sep 13 01:27:58 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Thu, 12 Sep 2002 17:27:58 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.11,1.12 timtest.py,1.22,1.23 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv25548 Modified Files: Options.py timtest.py Log Message: Incompatible change: show_best_discriminators has changed from a bool option to an int option, now giving the number of best discriminators to show. Set to 0 if you don't want to see any. Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.11 retrieving revision 1.12 diff -C2 -d -r1.11 -r1.12 *** Options.py 13 Sep 2002 00:14:18 -0000 1.11 --- Options.py 13 Sep 2002 00:27:55 -0000 1.12 *************** *** 73,77 **** show_false_positives: True show_false_negatives: False ! show_best_discriminators: True # The maximum # of characters to display for a msg displayed due to the --- 73,84 ---- show_false_positives: True show_false_negatives: False ! ! # Near the end of Driver.test(), you can get a listing of the "best ! # discriminators" in the words from the training sets. These are the ! # words whose WordInfo.killcount values are highest, meaning they most ! # often were among the most extreme clues spamprob() found. The number ! # of best discriminators to show is given by show_best_discriminators; ! # set this <= 0 to suppress showing any of the best discriminators. ! show_best_discriminators: 30 # The maximum # of characters to display for a msg displayed due to the *************** *** 121,125 **** 'show_false_negatives': boolean_cracker, 'show_histograms': boolean_cracker, ! 'show_best_discriminators': boolean_cracker, 'save_trained_pickles': boolean_cracker, 'pickle_basename': string_cracker, --- 128,132 ---- 'show_false_negatives': boolean_cracker, 'show_histograms': boolean_cracker, ! 'show_best_discriminators': int_cracker, 'save_trained_pickles': boolean_cracker, 'pickle_basename': string_cracker, Index: timtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timtest.py,v retrieving revision 1.22 retrieving revision 1.23 diff -C2 -d -r1.22 -r1.23 *** timtest.py 12 Sep 2002 02:58:02 -0000 1.22 --- timtest.py 13 Sep 2002 00:27:55 -0000 1.23 *************** *** 238,245 **** printmsg(e, prob, clues) ! if options.show_best_discriminators: print print " best discriminators:" ! stats = [(-1, None) for i in range(30)] smallest_killcount = -1 for w, r in c.wordinfo.iteritems(): --- 238,245 ---- printmsg(e, prob, clues) ! if options.show_best_discriminators > 0: print print " best discriminators:" ! stats = [(-1, None)] * options.show_best_discriminators smallest_killcount = -1 for w, r in c.wordinfo.iteritems(): From tim_one@users.sourceforge.net Fri Sep 13 03:40:52 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Thu, 12 Sep 2002 19:40:52 -0700 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.20,1.21 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv25025 Modified Files: tokenizer.py Log Message: Added comment about Reply-To (can't tell whether it's worth tokenizing; my error rates are too low now). Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.20 retrieving revision 1.21 diff -C2 -d -r1.20 -r1.21 *** tokenizer.py 12 Sep 2002 23:59:06 -0000 1.20 --- tokenizer.py 13 Sep 2002 02:40:50 -0000 1.21 *************** *** 867,873 **** # becomes the most powerful indicator in the whole database. # ! # From: ! # Reply-To: ! for field in ('from',):# 'reply-to',): prefix = field + ':' x = msg.get(field, 'none').lower() --- 867,875 ---- # becomes the most powerful indicator in the whole database. # ! # From: # this helps both rates ! # Reply-To: # my error rates are too low now to tell about this ! # # one (smalls wins & losses across runs, overall ! # # not significant), so leaving it out ! for field in ('from',): prefix = field + ':' x = msg.get(field, 'none').lower() From tim_one@users.sourceforge.net Fri Sep 13 17:27:01 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Fri, 13 Sep 2002 09:27:01 -0700 Subject: [Spambayes-checkins] spambayes TestDriver.py,NONE,1.1 Options.py,1.12,1.13 README.txt,1.14,1.15 Tester.py,1.1,1.2 mboxtest.py,1.3,1.4 timtest.py,1.23,1.24 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv6489 Modified Files: Options.py README.txt Tester.py mboxtest.py timtest.py Added Files: TestDriver.py Log Message: Moved most of the reusable stuff out of timtest.py into the new TestDriver.py. Added new methods to various things for upcoming support of efficient N-fold cross validation. timtest.py still works exactly the way it did before, and I *hope* mboxtest.py does too but I'm not set up to test that one. --- NEW FILE: TestDriver.py --- # Loop: # # Set up a new base classifier for testing. # train(ham, spam) # # Run tests against (possibly variants of) this classifier. # Loop: # Optional: # # Forget training for some subset of ham and spam. This # # works against the base classifier trained at the start. # forget(ham, spam) # # Predict against other data. # Loop: # test(ham, spam) # # Display stats against all runs on this classifier variant. # finishtest() # # Display stats against all runs. # alldone() from sets import Set import cPickle as pickle from heapq import heapreplace from Options import options import Tester import classifier class Hist: """Simple histograms of float values in [0.0, 1.0].""" def __init__(self, nbuckets=20): self.buckets = [0] * nbuckets self.nbuckets = nbuckets def add(self, x): n = self.nbuckets i = int(n * x) if i >= n: i = n-1 self.buckets[i] += 1 def __iadd__(self, other): if self.nbuckets != other.nbuckets: raise ValueError('bucket size mismatch') for i in range(self.nbuckets): self.buckets[i] += other.buckets[i] return self def display(self, WIDTH=60): biggest = max(self.buckets) hunit, r = divmod(biggest, WIDTH) if r: hunit += 1 print "* =", hunit, "items" ndigits = len(str(biggest)) format = "%6.2f %" + str(ndigits) + "d" for i, n in enumerate(self.buckets): print format % (100.0 * i / self.nbuckets, n), print '*' * ((n + hunit - 1) // hunit) def printhist(tag, ham, spam): print print "Ham distribution for", tag ham.display() print print "Spam distribution for", tag spam.display() def printmsg(msg, prob, clues): print msg.tag print "prob =", prob for clue in clues: print "prob(%r) = %g" % clue print guts = str(msg) if options.show_charlimit > 0: guts = guts[:options.show_charlimit] print guts class Driver: def __init__(self): self.falsepos = Set() self.falseneg = Set() self.global_ham_hist = Hist(options.nbuckets) self.global_spam_hist = Hist(options.nbuckets) self.ntimes_train_called = 0 def train(self, ham, spam): self.classifier = classifier.GrahamBayes() t = self.tester = Tester.Test(self.classifier) print "Training on", ham, "&", spam, "...", t.train(ham, spam) print t.nham, "hams &", t.nspam, "spams" self.orig_nham = t.nham self.orig_nspam = t.nspam self.trained_ham_hist = Hist(options.nbuckets) self.trained_spam_hist = Hist(options.nbuckets) self.ntimes_train_called += 1 if options.save_trained_pickles: fname = "%s%d.pik" % (options.pickle_basename, self.ntimes_train_called) print " saving pickle to", fname fp = file(fname, 'wb') pickle.dump(self.classifier, fp, 1) fp.close() def forget(self, ham, spam): c = self.classifier t = self.tester nham, nspam = self.orig_nham, self.orig_nspam t.set_classifier(c.copy(), nham, nspam) print "Forgetting", ham, "&", spam, "...", t.untrain(ham, spam) print nham - t.nham, "hams &", nspam - t.nspam, "spams" self.trained_ham_hist = Hist(options.nbuckets) self.trained_spam_hist = Hist(options.nbuckets) def finishtest(self): if options.show_histograms: printhist("all in this training set:", self.trained_ham_hist, self.trained_spam_hist) self.global_ham_hist += self.trained_ham_hist self.global_spam_hist += self.trained_spam_hist def alldone(self): if options.show_histograms: printhist("all runs:", self.global_ham_hist, self.global_spam_hist) def test(self, ham, spam): c = self.classifier t = self.tester local_ham_hist = Hist(options.nbuckets) local_spam_hist = Hist(options.nbuckets) def new_ham(msg, prob, lo=options.show_ham_lo, hi=options.show_ham_hi): local_ham_hist.add(prob) if lo <= prob <= hi: print print "Ham with prob =", prob prob, clues = c.spamprob(msg, True) printmsg(msg, prob, clues) def new_spam(msg, prob, lo=options.show_spam_lo, hi=options.show_spam_hi): local_spam_hist.add(prob) if lo <= prob <= hi: print print "Spam with prob =", prob prob, clues = c.spamprob(msg, True) printmsg(msg, prob, clues) t.reset_test_results() print " testing against", ham, "&", spam, "...", t.predict(spam, True, new_spam) t.predict(ham, False, new_ham) print t.nham_tested, "hams &", t.nspam_tested, "spams" print " false positive:", t.false_positive_rate() print " false negative:", t.false_negative_rate() newfpos = Set(t.false_positives()) - self.falsepos self.falsepos |= newfpos print " new false positives:", [e.tag for e in newfpos] if not options.show_false_positives: newfpos = () for e in newfpos: print '*' * 78 prob, clues = c.spamprob(e, True) printmsg(e, prob, clues) newfneg = Set(t.false_negatives()) - self.falseneg self.falseneg |= newfneg print " new false negatives:", [e.tag for e in newfneg] if not options.show_false_negatives: newfneg = () for e in newfneg: print '*' * 78 prob, clues = c.spamprob(e, True) printmsg(e, prob, clues) if options.show_best_discriminators > 0: print print " best discriminators:" stats = [(-1, None)] * options.show_best_discriminators smallest_killcount = -1 for w, r in c.wordinfo.iteritems(): if r.killcount > smallest_killcount: heapreplace(stats, (r.killcount, w)) smallest_killcount = stats[0][0] stats.sort() for count, w in stats: if count < 0: continue r = c.wordinfo[w] print " %r %d %g" % (w, r.killcount, r.spamprob) if options.show_histograms: printhist("this pair:", local_ham_hist, local_spam_hist) self.trained_ham_hist += local_ham_hist self.trained_spam_hist += local_spam_hist Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.12 retrieving revision 1.13 diff -C2 -d -r1.12 -r1.13 *** Options.py 13 Sep 2002 00:27:55 -0000 1.12 --- Options.py 13 Sep 2002 16:26:58 -0000 1.13 *************** *** 57,61 **** [TestDriver] ! # These control various displays in class Driver (timtest.py). # Number of buckets in histograms. --- 57,61 ---- [TestDriver] ! # These control various displays in class TestDriver.Driver. # Number of buckets in histograms. Index: README.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/README.txt,v retrieving revision 1.14 retrieving revision 1.15 diff -C2 -d -r1.14 -r1.15 *** README.txt 9 Sep 2002 19:24:52 -0000 1.14 --- README.txt 13 Sep 2002 16:26:58 -0000 1.15 *************** *** 20,35 **** ! Primary Files ! ============= Options.py ! A start at a flexible way to control what the tokenizer and ! classifier do. Different people are finding different ways in ! which their test data is biased, and so fiddle the code to ! worm around that. It's become almost impossible to know ! exactly what someone did when they report results. classifier.py An implementation of a Graham-like classifier. Tester.py A test-driver class that feeds streams of msgs to a classifier --- 20,44 ---- ! Primary Core Files ! ================== Options.py ! Uses ConfigParser to allow fiddling various aspects of the classifier, ! tokenizer, and test drivers. Create a file named bayescustomize.ini to ! alter the defaults; all options and their default values can be found ! in the string "defaults" near the top of Options.py, which is really ! an .ini file embedded in the module. Modules wishing to control ! aspects of their operation merely do ! ! from Options import options ! ! near the start, and consult attributes of options. classifier.py An implementation of a Graham-like classifier. + tokenizer.py + An implementation of tokenize() that Tim can't seem to help but keep + working on . + Tester.py A test-driver class that feeds streams of msgs to a classifier *************** *** 37,58 **** of false positives and false negatives. hammie.py ! A spamassassin-like filter which uses tokenizer (below) and ! classifier (above). Needs to be made faster, especially for writes. - mboxtest.py - A concrete test driver like timtest.py (see below), but working - with a pair of mailbox files rather than the specialized timtest - setup. ! tokenizer.py ! An implementation of tokenize() that Tim can't seem to help but keep ! working on . timtest.py ! A concrete test driver that uses Tester and classifier (above). This ! assumes "a standard" test data setup (see below). Could stand massive ! refactoring. You need to fiddle a line near the top to import a ! tokenize() function of your choosing. --- 46,75 ---- of false positives and false negatives. + TestDriver.py + A higher layer of test helpers, building on Tester above. It's + quite usable as-is for building simple test drivers, and more + complicated ones up to NxN test grids. It's in the process of being + extended to allow easy building of N-way cross validation drivers + (the trick to that is doing so efficiently). See also rates.py + and cmp.py below. + + + Apps + ==== hammie.py ! A spamassassin-like filter which uses tokenizer and classifier (above). ! Needs to be made faster, especially for writes. ! Concrete Test Drivers ! ===================== ! mboxtest.py ! A concrete test driver like timtest.py, but working with a pair of ! mailbox files rather than the specialized timtest setup. timtest.py ! A concrete test driver like mboxtest.py, but working with "a ! standard" test data setup (see below) rather than the specialized ! mboxtest setup. *************** *** 105,108 **** --- 122,131 ---- Standard Test Data Setup ======================== + [Caution: I'm going to switch this to support N-way cross validation, + instead of an NxN test grid. The only effect on the directory structure + here is that you'll want more directories with fewer msgs in each + (splitting the data at random into 10 pairs should work very well). + ] + Barry gave me mboxes, but the spam corpus I got off the web had one spam per file, and it only took two days of extreme pain to realize that one msg Index: Tester.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Tester.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** Tester.py 5 Sep 2002 16:16:43 -0000 1.1 --- Tester.py 13 Sep 2002 16:26:58 -0000 1.2 *************** *** 2,10 **** # Pass a classifier instance (an instance of GrahamBayes). # Loop: ! # Optional: ! # Train it, via train(). ! # reset_test_results() # Loop: ! # invoke predict() with (probably new) examples # Optional: # suck out the results, via instance vrbls and --- 2,17 ---- # Pass a classifier instance (an instance of GrahamBayes). # Loop: ! # # Train the classifer with new ham and spam. ! # train(ham, spam) # this implies reset_test_results # Loop: ! # Optional: ! # # Possibly fiddle the classifier. ! # set_classifier() ! # # Forget smessages the classifier was trained on. ! # untrain(ham, spam) # this implies reset_test_results ! # Optional: ! # reset_test_results() ! # # Predict against (presumably new) examples. ! # predict(ham, spam) # Optional: # suck out the results, via instance vrbls and *************** *** 13,20 **** def __init__(self, classifier): self.classifier = classifier # The number of ham and spam instances in the training data. ! self.nham = self.nspam = 0 ! self.reset_test_results() def reset_test_results(self): --- 20,33 ---- def __init__(self, classifier): + self.set_classifier(classifier, 0, 0) + self.reset_test_results() + + # Tell the tester which classifier to use, and how many ham and spam it's + # been trained on. + def set_classifier(self, classifier, nham, nspam): self.classifier = classifier # The number of ham and spam instances in the training data. ! self.nham = nham ! self.nspam = nspam def reset_test_results(self): *************** *** 33,38 **** # Train the classifier on streams of ham and spam. Updates probabilities ! # before returning. def train(self, hamstream=None, spamstream=None): learn = self.classifier.learn if hamstream is not None: --- 46,52 ---- # Train the classifier on streams of ham and spam. Updates probabilities ! # before returning, and resets test results. def train(self, hamstream=None, spamstream=None): + self.reset_test_results() learn = self.classifier.learn if hamstream is not None: *************** *** 46,49 **** --- 60,78 ---- self.classifier.update_probabilities() + # Untrain the classifier on streams of ham and spam. Updates + # probabilities before returning, and resets test results. + def untrain(self, hamstream=None, spamstream=None): + self.reset_test_results() + unlearn = self.classifier.unlearn + if hamstream is not None: + for example in hamstream: + unlearn(example, False, False) + self.nham -= 1 + if spamstream is not None: + for example in spamstream: + unlearn(example, True, False) + self.nspam -= 1 + self.classifier.update_probabilities() + # Run prediction on each sample in stream. You're swearing that stream # is entirely composed of spam (is_spam True), or of ham (is_spam False). *************** *** 113,117 **** >>> t = Test(GrahamBayes()) >>> t.train([good1, good2], [bad1]) - >>> t.reset_test_results() >>> t.predict([_Example('goodham', ['a', 'b']), ... _Example('badham', ['d']) --- 142,145 ---- Index: mboxtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/mboxtest.py,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** mboxtest.py 12 Sep 2002 02:46:15 -0000 1.3 --- mboxtest.py 13 Sep 2002 16:26:58 -0000 1.4 *************** *** 8,12 **** One of unix, mmdf, mh, or qmail. Specifies mailbox format for ham and spam files. Default is unix. ! -n NSETS Number of test sets to create for a single mailbox. Default is 5. --- 8,12 ---- One of unix, mmdf, mh, or qmail. Specifies mailbox format for ham and spam files. Default is unix. ! -n NSETS Number of test sets to create for a single mailbox. Default is 5. *************** *** 19,27 **** """ - from tokenizer import Tokenizer, subject_word_re, tokenize_word, tokenize - from classifier import GrahamBayes - from Tester import Test - from timtest import Driver, Msg - import getopt import mailbox --- 19,22 ---- *************** *** 30,33 **** --- 25,32 ---- import sys + from tokenizer import Tokenizer, subject_word_re, tokenize_word, tokenize + from TestDriver import Driver + from timtest import Msg + mbox_fmts = {"unix": mailbox.PortableUnixMailbox, "mmdf": mailbox.MmdfMailbox, *************** *** 129,133 **** def main(args): global FMT ! FMT = "unix" NSETS = 5 --- 128,132 ---- def main(args): global FMT ! FMT = "unix" NSETS = 5 *************** *** 163,167 **** for ispam in randindices(nspam, NSETS): testsets.append((sort(iham), sort(ispam))) ! driver = Driver() --- 162,166 ---- for ispam in randindices(nspam, NSETS): testsets.append((sort(iham), sort(ispam))) ! driver = Driver() *************** *** 177,179 **** if __name__ == "__main__": sys.exit(main(sys.argv[1:])) - --- 176,177 ---- Index: timtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timtest.py,v retrieving revision 1.23 retrieving revision 1.24 diff -C2 -d -r1.23 -r1.24 *** timtest.py 13 Sep 2002 00:27:55 -0000 1.23 --- timtest.py 13 Sep 2002 16:26:58 -0000 1.24 *************** *** 20,31 **** import os import sys - from sets import Set - import cPickle as pickle - from heapq import heapreplace - import Tester - import classifier - from tokenizer import tokenize from Options import options program = sys.argv[0] --- 20,27 ---- import os import sys from Options import options + from tokenizer import tokenize + from TestDriver import Driver program = sys.argv[0] *************** *** 39,95 **** sys.exit(code) - class Hist: - def __init__(self, nbuckets=20): - self.buckets = [0] * nbuckets - self.nbuckets = nbuckets - - def add(self, x): - n = self.nbuckets - i = int(n * x) - if i >= n: - i = n-1 - self.buckets[i] += 1 - - def __iadd__(self, other): - if self.nbuckets != other.nbuckets: - raise ValueError('bucket size mismatch') - for i in range(self.nbuckets): - self.buckets[i] += other.buckets[i] - return self - - def display(self, WIDTH=60): - biggest = max(self.buckets) - hunit, r = divmod(biggest, WIDTH) - if r: - hunit += 1 - print "* =", hunit, "items" - - ndigits = len(str(biggest)) - format = "%6.2f %" + str(ndigits) + "d" - - for i, n in enumerate(self.buckets): - print format % (100.0 * i / self.nbuckets, n), - print '*' * ((n + hunit - 1) // hunit) - - def printhist(tag, ham, spam): - print - print "Ham distribution for", tag - ham.display() - - print - print "Spam distribution for", tag - spam.display() - - def printmsg(msg, prob, clues): - print msg.tag - print "prob =", prob - for clue in clues: - print "prob(%r) = %g" % clue - print - guts = str(msg) - if options.show_charlimit > 0: - guts = guts[:options.show_charlimit] - print guts - class Msg(object): def __init__(self, dir, name): --- 35,38 ---- *************** *** 125,129 **** yield Msg(directory, fname) ! def xproduce(self): import random directory = self.directory --- 68,72 ---- yield Msg(directory, fname) ! def produce(self): import random directory = self.directory *************** *** 136,261 **** def __iter__(self): return self.produce() - - - # Loop: - # train() # on ham and spam - # Loop: - # test() # on presumably new ham and spam - # finishtest() # display stats against all runs on training set - # alldone() # display stats against all runs - - class Driver: - - def __init__(self): - self.falsepos = Set() - self.falseneg = Set() - self.global_ham_hist = Hist(options.nbuckets) - self.global_spam_hist = Hist(options.nbuckets) - self.ntimes_train_called = 0 - - def train(self, ham, spam): - self.classifier = classifier.GrahamBayes() - t = self.tester = Tester.Test(self.classifier) - - print "Training on", ham, "&", spam, "...", - t.train(ham, spam) - print t.nham, "hams &", t.nspam, "spams" - - self.trained_ham_hist = Hist(options.nbuckets) - self.trained_spam_hist = Hist(options.nbuckets) - - self.ntimes_train_called += 1 - if options.save_trained_pickles: - fname = "%s%d.pik" % (options.pickle_basename, - self.ntimes_train_called) - print " saving pickle to", fname - fp = file(fname, 'wb') - pickle.dump(self.classifier, fp, 1) - fp.close() - - def finishtest(self): - if options.show_histograms: - printhist("all in this training set:", - self.trained_ham_hist, self.trained_spam_hist) - self.global_ham_hist += self.trained_ham_hist - self.global_spam_hist += self.trained_spam_hist - - def alldone(self): - if options.show_histograms: - printhist("all runs:", self.global_ham_hist, self.global_spam_hist) - - def test(self, ham, spam): - c = self.classifier - t = self.tester - local_ham_hist = Hist(options.nbuckets) - local_spam_hist = Hist(options.nbuckets) - - def new_ham(msg, prob, lo=options.show_ham_lo, - hi=options.show_ham_hi): - local_ham_hist.add(prob) - if lo <= prob <= hi: - print - print "Ham with prob =", prob - prob, clues = c.spamprob(msg, True) - printmsg(msg, prob, clues) - - def new_spam(msg, prob, lo=options.show_spam_lo, - hi=options.show_spam_hi): - local_spam_hist.add(prob) - if lo <= prob <= hi: - print - print "Spam with prob =", prob - prob, clues = c.spamprob(msg, True) - printmsg(msg, prob, clues) - - t.reset_test_results() - print " testing against", ham, "&", spam, "...", - t.predict(spam, True, new_spam) - t.predict(ham, False, new_ham) - print t.nham_tested, "hams &", t.nspam_tested, "spams" - - print " false positive:", t.false_positive_rate() - print " false negative:", t.false_negative_rate() - - newfpos = Set(t.false_positives()) - self.falsepos - self.falsepos |= newfpos - print " new false positives:", [e.tag for e in newfpos] - if not options.show_false_positives: - newfpos = () - for e in newfpos: - print '*' * 78 - prob, clues = c.spamprob(e, True) - printmsg(e, prob, clues) - - newfneg = Set(t.false_negatives()) - self.falseneg - self.falseneg |= newfneg - print " new false negatives:", [e.tag for e in newfneg] - if not options.show_false_negatives: - newfneg = () - for e in newfneg: - print '*' * 78 - prob, clues = c.spamprob(e, True) - printmsg(e, prob, clues) - - if options.show_best_discriminators > 0: - print - print " best discriminators:" - stats = [(-1, None)] * options.show_best_discriminators - smallest_killcount = -1 - for w, r in c.wordinfo.iteritems(): - if r.killcount > smallest_killcount: - heapreplace(stats, (r.killcount, w)) - smallest_killcount = stats[0][0] - stats.sort() - for count, w in stats: - if count < 0: - continue - r = c.wordinfo[w] - print " %r %d %g" % (w, r.killcount, r.spamprob) - - if options.show_histograms: - printhist("this pair:", local_ham_hist, local_spam_hist) - self.trained_ham_hist += local_ham_hist - self.trained_spam_hist += local_spam_hist def drive(nsets): --- 79,82 ---- From tim_one@users.sourceforge.net Fri Sep 13 17:55:20 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Fri, 13 Sep 2002 09:55:20 -0700 Subject: [Spambayes-checkins] spambayes classifier.py,1.6,1.7 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv16545 Modified Files: classifier.py Log Message: Removed GrahamBayes.DEBUG. It slows things down and I've never had a use for it (the options support printing lots of stuff from the test drivers, and that's always been plenty to resolve suspected bugs). Index: classifier.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/classifier.py,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** classifier.py 13 Sep 2002 00:14:18 -0000 1.6 --- classifier.py 13 Sep 2002 16:55:17 -0000 1.7 *************** *** 219,224 **** ) - DEBUG = False - def __init__(self): self.wordinfo = {} --- 219,222 ---- *************** *** 436,442 **** """ - if self.DEBUG: - print "spamprob(%r)" % wordstream - # A priority queue to remember the MAX_DISCRIMINATORS best # probabilities, where "best" means largest distance from 0.5. --- 434,437 ---- *************** *** 495,500 **** if evidence: clues.append((word, prob)) - if self.DEBUG: - print 'nbest P(%r) = %g' % (word, prob) prob_product *= prob / sp inverse_prob_product *= (1.0 - prob) / hp --- 490,493 ---- *************** *** 577,585 **** self.wordinfo[word] = record - if self.DEBUG: - print 'New probabilities:' - for w, r in self.wordinfo.iteritems(): - print "P(%r) = %g" % (w, r.spamprob) - def clearjunk(self, oldesttime): """Forget useless wordinfo records. This can shrink the database size. --- 570,573 ---- *************** *** 593,604 **** tonuke = [w for w, r in wordinfo.iteritems() if r.atime < oldesttime] for w in tonuke: - if self.DEBUG: - print "clearjunk removing word %r: %r" % (w, r) del wordinfo[w] def _add_msg(self, wordstream, is_spam): - if self.DEBUG: - print "_add_msg(%r, %r)" % (wordstream, is_spam) - if is_spam: self.nspam += 1 --- 581,587 ---- *************** *** 620,631 **** wordinfo[word] = record - if self.DEBUG: - print "new count for %r = %d" % (word, - is_spam and record.spamcount or record.hamcount) - def _remove_msg(self, wordstream, is_spam): - if self.DEBUG: - print "_remove_msg(%r, %r)" % (wordstream, is_spam) - if is_spam: if self.nspam <= 0: --- 603,607 ---- From tim_one@users.sourceforge.net Fri Sep 13 18:49:06 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Fri, 13 Sep 2002 10:49:06 -0700 Subject: [Spambayes-checkins] spambayes TestDriver.py,1.1,1.2 Tester.py,1.2,1.3 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv30444 Modified Files: TestDriver.py Tester.py Log Message: A little closer to N-fold cross validation. Removed the Tester nham and nspam attributes. If used properly, they should have exactly the same values as the classifier's attributes of the same names. Duplicating the info just created more chances to screw up. Changed when classifier pickles are saved, from immediately after training to Driver.finishtest(). This way meaningful killcounts are pickled. Since WordInfo.spamprob is almost never 0.5 anymore, it would be nice to have another gimmick for pruning junk from the database that doesn't rely on months going by to see which records remain unused. It *may* work well to prune away WordInfo records that never survived into spamprob()'s nbest list during testing. That's speculation and needs to be verified via testing; I don't expect to get to that in the near future, though; note that testing this would require splitting the data in a different way, since, by construction, a word with killcount=0 had no effect whatsoever on any outcome during predictions. A very quick check suggested that about half the words in a database do have killcount 0; I'm surprised it's not a lot more than that, so maybe I did something wrong; or maybe that's really how things are. Index: TestDriver.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** TestDriver.py 13 Sep 2002 16:26:58 -0000 1.1 --- TestDriver.py 13 Sep 2002 17:49:02 -0000 1.2 *************** *** 12,15 **** --- 12,17 ---- # test(ham, spam) # # Display stats against all runs on this classifier variant. + # # This also saves the trained classifer, if desired (option + # # save_trained_pickles). # finishtest() # # Display stats against all runs. *************** *** 86,123 **** self.global_ham_hist = Hist(options.nbuckets) self.global_spam_hist = Hist(options.nbuckets) ! self.ntimes_train_called = 0 def train(self, ham, spam): ! self.classifier = classifier.GrahamBayes() ! t = self.tester = Tester.Test(self.classifier) print "Training on", ham, "&", spam, "...", t.train(ham, spam) ! print t.nham, "hams &", t.nspam, "spams" ! self.orig_nham = t.nham ! self.orig_nspam = t.nspam self.trained_ham_hist = Hist(options.nbuckets) self.trained_spam_hist = Hist(options.nbuckets) - self.ntimes_train_called += 1 - if options.save_trained_pickles: - fname = "%s%d.pik" % (options.pickle_basename, - self.ntimes_train_called) - print " saving pickle to", fname - fp = file(fname, 'wb') - pickle.dump(self.classifier, fp, 1) - fp.close() - def forget(self, ham, spam): ! c = self.classifier ! t = self.tester ! nham, nspam = self.orig_nham, self.orig_nspam ! t.set_classifier(c.copy(), nham, nspam) print "Forgetting", ham, "&", spam, "...", ! t.untrain(ham, spam) ! print nham - t.nham, "hams &", nspam - t.nspam, "spams" self.trained_ham_hist = Hist(options.nbuckets) self.trained_spam_hist = Hist(options.nbuckets) --- 88,118 ---- self.global_ham_hist = Hist(options.nbuckets) self.global_spam_hist = Hist(options.nbuckets) ! self.ntimes_finishtest_called = 0 def train(self, ham, spam): ! c = self.classifier = classifier.GrahamBayes() ! t = self.tester = Tester.Test(c) print "Training on", ham, "&", spam, "...", t.train(ham, spam) ! print c.nham, "hams &", c.nspam, "spams" self.trained_ham_hist = Hist(options.nbuckets) self.trained_spam_hist = Hist(options.nbuckets) def forget(self, ham, spam): ! import copy print "Forgetting", ham, "&", spam, "...", ! c = self.classifier ! nham, nspam = c.nham, c.nspam ! c = copy.deepcopy(c) ! t.set_classifier(c) ! ! self.tester.untrain(ham, spam) ! print nham - c.nham, "hams &", nspam - c.nspam, "spams" + self.global_ham_hist += self.trained_ham_hist + self.global_spam_hist += self.trained_spam_hist self.trained_ham_hist = Hist(options.nbuckets) self.trained_spam_hist = Hist(options.nbuckets) *************** *** 129,132 **** --- 124,136 ---- self.global_ham_hist += self.trained_ham_hist self.global_spam_hist += self.trained_spam_hist + + self.ntimes_finishtest_called += 1 + if options.save_trained_pickles: + fname = "%s%d.pik" % (options.pickle_basename, + self.ntimes_finishtest_called) + print " saving pickle to", fname + fp = file(fname, 'wb') + pickle.dump(self.classifier, fp, 1) + fp.close() def alldone(self): Index: Tester.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Tester.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** Tester.py 13 Sep 2002 16:26:58 -0000 1.2 --- Tester.py 13 Sep 2002 17:49:02 -0000 1.3 *************** *** 20,33 **** def __init__(self, classifier): ! self.set_classifier(classifier, 0, 0) self.reset_test_results() ! # Tell the tester which classifier to use, and how many ham and spam it's ! # been trained on. ! def set_classifier(self, classifier, nham, nspam): self.classifier = classifier - # The number of ham and spam instances in the training data. - self.nham = nham - self.nspam = nspam def reset_test_results(self): --- 20,29 ---- def __init__(self, classifier): ! self.set_classifier(classifier) self.reset_test_results() ! # Tell the tester which classifier to use. ! def set_classifier(self, classifier): self.classifier = classifier def reset_test_results(self): *************** *** 53,61 **** for example in hamstream: learn(example, False, False) - self.nham += 1 if spamstream is not None: for example in spamstream: learn(example, True, False) - self.nspam += 1 self.classifier.update_probabilities() --- 49,55 ---- *************** *** 68,76 **** for example in hamstream: unlearn(example, False, False) - self.nham -= 1 if spamstream is not None: for example in spamstream: unlearn(example, True, False) - self.nspam -= 1 self.classifier.update_probabilities() --- 62,68 ---- From tim_one@users.sourceforge.net Fri Sep 13 19:48:44 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Fri, 13 Sep 2002 11:48:44 -0700 Subject: [Spambayes-checkins] spambayes timtest.py,1.24,1.25 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv27680 Modified Files: timtest.py Log Message: Checked in a temp change by mistake. Index: timtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timtest.py,v retrieving revision 1.24 retrieving revision 1.25 diff -C2 -d -r1.24 -r1.25 *** timtest.py 13 Sep 2002 16:26:58 -0000 1.24 --- timtest.py 13 Sep 2002 18:48:42 -0000 1.25 *************** *** 68,72 **** yield Msg(directory, fname) ! def produce(self): import random directory = self.directory --- 68,72 ---- yield Msg(directory, fname) ! def xproduce(self): import random directory = self.directory From tim_one@users.sourceforge.net Fri Sep 13 20:33:06 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Fri, 13 Sep 2002 12:33:06 -0700 Subject: [Spambayes-checkins] spambayes timcv.py,NONE,1.1 README.txt,1.15,1.16 TestDriver.py,1.2,1.3 classifier.py,1.7,1.8 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv8875 Modified Files: README.txt TestDriver.py classifier.py Added Files: timcv.py Log Message: timcv may or may not be a working N-fold cross validating test driver. If it's not, it's getting close . This turned up a few bugs in other places, primarily that GrahamBayes._remove_msg() didn't delete a word record if the spam and ham counts both fell to 0. It's a subtle invariant of the scheme that at least one of those counts is non-zero. --- NEW FILE: timcv.py --- #! /usr/bin/env python # At the moment, this requires Python 2.3 from CVS (heapq, Set, enumerate). # A driver for N-fold cross validation. """Usage: %(program)s [-h] -n nsets Where: -h Show usage and exit. -n int Number of Set directories (Data/Spam/Set1, ... and Data/Ham/Set1, ...). This is required. In addition, an attempt is made to merge bayescustomize.ini into the options. If that exists, it can be used to change the settings in Options.options. """ import os import sys from Options import options from tokenizer import tokenize from TestDriver import Driver program = sys.argv[0] def usage(code, msg=''): """Print usage message and sys.exit(code).""" if msg: print >> sys.stderr, msg print >> sys.stderr print >> sys.stderr, __doc__ % globals() sys.exit(code) class Msg(object): def __init__(self, dir, name): path = dir + "/" + name self.tag = path f = open(path, 'rb') guts = f.read() f.close() self.guts = guts def __iter__(self): return tokenize(self.guts) def __hash__(self): return hash(self.tag) def __eq__(self, other): return self.tag == other.tag def __str__(self): return self.guts class MsgStream(object): def __init__(self, tag, directories): self.tag = tag self.directories = directories def __str__(self): return self.tag def produce(self): for directory in self.directories: for fname in os.listdir(directory): yield Msg(directory, fname) def __iter__(self): return self.produce() def drive(nsets): print options.display() hamdirs = ["Data/Ham/Set%d" % i for i in range(1, nsets+1)] spamdirs = ["Data/Spam/Set%d" % i for i in range(1, nsets+1)] d = Driver() # Train it on all the data. d.train(MsgStream("%s-%d" % (hamdirs[0], nsets), hamdirs), MsgStream("%s-%d" % (spamdirs[0], nsets), spamdirs)) # Now run nsets times, removing one pair per run. for i in range(nsets): h = hamdirs[:] s = spamdirs[:] hexclude = h.pop(i) sexclude = s.pop(i) d.forget(MsgStream(hexclude, [hexclude]), MsgStream(sexclude, [sexclude])) d.test(MsgStream("Data/Ham/*-Set%d" % (i+1), h), MsgStream("Data/Spam/*-Set%d" % (i+1), s)) d.finishtest() d.alldone() if __name__ == "__main__": import getopt try: opts, args = getopt.getopt(sys.argv[1:], 'hn:') except getopt.error, msg: usage(1, msg) nsets = None for opt, arg in opts: if opt == '-h': usage(0) elif opt == '-n': nsets = int(arg) if args: usage(1, "Positional arguments not supported") if nsets is None: usage(1, "-n is required") drive(nsets) Index: README.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/README.txt,v retrieving revision 1.15 retrieving revision 1.16 diff -C2 -d -r1.15 -r1.16 *** README.txt 13 Sep 2002 16:26:58 -0000 1.15 --- README.txt 13 Sep 2002 19:33:04 -0000 1.16 *************** *** 73,76 **** --- 73,81 ---- mboxtest setup. + timcv.py + A first stab at an N-fold cross-validating test driver. Assumes + "a standard" data directory setup (see below). + Subject to arbitrary change. + Test Utilities Index: TestDriver.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** TestDriver.py 13 Sep 2002 17:49:02 -0000 1.2 --- TestDriver.py 13 Sep 2002 19:33:04 -0000 1.3 *************** *** 104,112 **** import copy ! print "Forgetting", ham, "&", spam, "...", c = self.classifier nham, nspam = c.nham, c.nspam c = copy.deepcopy(c) ! t.set_classifier(c) self.tester.untrain(ham, spam) --- 104,112 ---- import copy ! print " forgetting", ham, "&", spam, "...", c = self.classifier nham, nspam = c.nham, c.nspam c = copy.deepcopy(c) ! self.tester.set_classifier(c) self.tester.untrain(ham, spam) Index: classifier.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/classifier.py,v retrieving revision 1.7 retrieving revision 1.8 diff -C2 -d -r1.7 -r1.8 *** classifier.py 13 Sep 2002 16:55:17 -0000 1.7 --- classifier.py 13 Sep 2002 19:33:04 -0000 1.8 *************** *** 623,624 **** --- 623,626 ---- if record.hamcount > 0: record.hamcount -= 1 + if record.hamcount == 0 == record.spamcount: + del self.wordinfo[word] From tim_one@users.sourceforge.net Fri Sep 13 20:46:43 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Fri, 13 Sep 2002 12:46:43 -0700 Subject: [Spambayes-checkins] spambayes classifier.py,1.8,1.9 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv13430 Modified Files: classifier.py Log Message: Class WordInfo: Noted a subtle invariant in a comment. Index: classifier.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/classifier.py,v retrieving revision 1.8 retrieving revision 1.9 diff -C2 -d -r1.8 -r1.9 *** classifier.py 13 Sep 2002 19:33:04 -0000 1.8 --- classifier.py 13 Sep 2002 19:46:41 -0000 1.9 *************** *** 185,188 **** --- 185,192 ---- 'spamprob', # prob(spam | msg contains this word) ) + + # Invariant: For use in a classifier database, at least one of + # spamcount and hamcount must be non-zero. + # # (*)atime is the last access time, a UTC time.time() value. It's the # most recent time this word was used by scoring (i.e., by spamprob(), From tim_one@users.sourceforge.net Fri Sep 13 20:59:37 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Fri, 13 Sep 2002 12:59:37 -0700 Subject: [Spambayes-checkins] spambayes timcv.py,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv17794 Modified Files: timcv.py Log Message: Msg.__init__: tiny simplification. Index: timcv.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timcv.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** timcv.py 13 Sep 2002 19:33:04 -0000 1.1 --- timcv.py 13 Sep 2002 19:59:35 -0000 1.2 *************** *** 39,45 **** self.tag = path f = open(path, 'rb') ! guts = f.read() f.close() - self.guts = guts def __iter__(self): --- 39,44 ---- self.tag = path f = open(path, 'rb') ! self.guts = f.read() f.close() def __iter__(self): From tim_one@users.sourceforge.net Fri Sep 13 21:35:40 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Fri, 13 Sep 2002 13:35:40 -0700 Subject: [Spambayes-checkins] spambayes timcv.py,1.2,1.3 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv29041 Modified Files: timcv.py Log Message: Fixed some major brainos, but this is still hosed. Worse, thanks at least to giant pickle memos and giant deepcopy memos, running just a 3-fold c-v on 3 of my test directory pairs takes more than 3X the memory of running the 5x5 test grid over all 5 directory pairs. So this isn't at all usable yet. Luckily, it's not working right anyway . Index: timcv.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timcv.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** timcv.py 13 Sep 2002 19:59:35 -0000 1.2 --- timcv.py 13 Sep 2002 20:35:37 -0000 1.3 *************** *** 83,94 **** # Now run nsets times, removing one pair per run. for i in range(nsets): ! h = hamdirs[:] ! s = spamdirs[:] ! hexclude = h.pop(i) ! sexclude = s.pop(i) ! d.forget(MsgStream(hexclude, [hexclude]), ! MsgStream(sexclude, [sexclude])) ! d.test(MsgStream("Data/Ham/*-Set%d" % (i+1), h), ! MsgStream("Data/Spam/*-Set%d" % (i+1), s)) d.finishtest() d.alldone() --- 83,92 ---- # Now run nsets times, removing one pair per run. for i in range(nsets): ! h = hamdirs[i] ! s = spamdirs[i] ! hamstream = MsgStream(h, [h]) ! spamstream = MsgStream(s, [s]) ! d.forget(hamstream, spamstream) ! d.test(hamstream, spamstream) d.finishtest() d.alldone() From rubiconx@users.sourceforge.net Fri Sep 13 22:27:28 2002 From: rubiconx@users.sourceforge.net (Neale Pickett) Date: Fri, 13 Sep 2002 14:27:28 -0700 Subject: [Spambayes-checkins] spambayes cdbhammie.py,1.2,NONE cdbwrap.py,1.1,NONE Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv15963 Removed Files: cdbhammie.py cdbwrap.py Log Message: Taking out the cdb stuff, as I'm not going to persue it further. It's in the attic now if anyone wants to mess with it later. --- cdbhammie.py DELETED --- --- cdbwrap.py DELETED --- From tim_one@users.sourceforge.net Sat Sep 14 01:03:53 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Fri, 13 Sep 2002 17:03:53 -0700 Subject: [Spambayes-checkins] spambayes README.txt,1.16,1.17 TestDriver.py,1.3,1.4 cmp.py,1.7,1.8 mboxtest.py,1.4,1.5 rates.py,1.3,1.4 timcv.py,1.3,1.4timtest.py,1.25,1.26 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv23741 Modified Files: README.txt TestDriver.py cmp.py mboxtest.py rates.py timcv.py timtest.py Log Message: Lots of small changes to support N-fold cross validation properly. timcv.py now does this. The pragmatic problem with giant pickle memos and giant deepcopy memos is gone -- instead the test driver has to take more care to train and untrain appropriate pieces explicitly. This is actually easy (see timcv). TestDriver.Driver now prints statistics with a recognizable pattern at the start of the line, so that rates.py doesn't feel so arbitrary anymore. rates.py and cmp.py were changed accordingly. rates.py now puts a lot more stuff in the summary, including accounts of how many ham and spam were trained on, and predicted against, in each test run. Driver() clients have to explictly tell Driver when they want a new classifier now; I changed timtest and mboxtest to do that, but am not set up to exercise mboxtest. Driver, rates and cmp no longer make assumptions about the *kind* of test being run, and work equally well for, e.g., NxN grids or N-fold c-v. rates.py also computes the average f-p and f-n rates now, and cmp.py displays before-and-after values for those too. Average rates are intended to be used when doing N-fold c-v; they make less sense for an NxN test grid. Index: README.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/README.txt,v retrieving revision 1.16 retrieving revision 1.17 diff -C2 -d -r1.16 -r1.17 *** README.txt 13 Sep 2002 19:33:04 -0000 1.16 --- README.txt 14 Sep 2002 00:03:51 -0000 1.17 *************** *** 14,18 **** later -- as is, the false positive rate has gotten too small to measure reliably across test sets with 4000 hams + 2750 spams, but the false ! negative rate is still over 1%. The code here depends in various ways on the latest Python from CVS --- 14,19 ---- later -- as is, the false positive rate has gotten too small to measure reliably across test sets with 4000 hams + 2750 spams, but the false ! negative rate is still over 1%. Later: the f-n rate has also gotten ! too small to measure reliably across that much training data. The code here depends in various ways on the latest Python from CVS *************** *** 47,56 **** TestDriver.py ! A higher layer of test helpers, building on Tester above. It's ! quite usable as-is for building simple test drivers, and more ! complicated ones up to NxN test grids. It's in the process of being ! extended to allow easy building of N-way cross validation drivers ! (the trick to that is doing so efficiently). See also rates.py ! and cmp.py below. --- 48,55 ---- TestDriver.py ! A flexible higher layer of test helpers, building on Tester above. ! For example, it's usable for building simple test drivers, NxN test ! grids, and N-fold cross validation drivers. See also rates.py and ! cmp.py below. *************** *** 71,75 **** A concrete test driver like mboxtest.py, but working with "a standard" test data setup (see below) rather than the specialized ! mboxtest setup. timcv.py --- 70,74 ---- A concrete test driver like mboxtest.py, but working with "a standard" test data setup (see below) rather than the specialized ! mboxtest setup. This runs an NxN test grid, skipping the diagonal. timcv.py *************** *** 82,92 **** ============== rates.py ! Scans the output (so far) from timtest.py, and captures summary ! statistics. cmp.py Given two summary files produced by rates.py, displays an account of all the f-p and f-n rates side-by-side, along with who won which ! (etc), and the change in total # of f-ps and f-n. --- 81,92 ---- ============== rates.py ! Scans the output (so far) produced by TestDriver.Drive(), and captures ! summary statistics. cmp.py Given two summary files produced by rates.py, displays an account of all the f-p and f-n rates side-by-side, along with who won which ! (etc), the change in total # of unique false positives and negatives, ! and the change in average f-p and f-n rates. *************** *** 127,136 **** Standard Test Data Setup ======================== - [Caution: I'm going to switch this to support N-way cross validation, - instead of an NxN test grid. The only effect on the directory structure - here is that you'll want more directories with fewer msgs in each - (splitting the data at random into 10 pairs should work very well). - ] - Barry gave me mboxes, but the spam corpus I got off the web had one spam per file, and it only took two days of extreme pain to realize that one msg --- 127,130 ---- *************** *** 142,145 **** --- 136,142 ---- The directory structure under my spambayes directory looks like so: + [But due to a better testing infrastructure, I'm going to spread this + across 20 subdirectories under Spam and under Ham, and use groups + of 10 for 10-fold cross validation] Data/ *************** *** 159,167 **** If you use the same names and structure, huge mounds of the tedious testing ! code will work as-is. The more Set directories the merrier, although ! you'll hit a point of diminishing returns if you exceed 10. The "reservoir" ! directory contains a few thousand other random hams. When a ham is found ! that's really spam, I delete it, and then the rebal.py utility moves in a ! message at random from the reservoir to replace it. If I had it to do over again, I think I'd move such spam into a Spam set (chosen at random), instead of deleting it. --- 156,164 ---- If you use the same names and structure, huge mounds of the tedious testing ! code will work as-is. The more Set directories the merrier, although you ! want at least a few hundred messages in each one. The "reservoir" directory ! contains a few thousand other random hams. When a ham is found that's ! really spam, I delete it, and then the rebal.py utility moves in a message ! at random from the reservoir to replace it. If I had it to do over again, I think I'd move such spam into a Spam set (chosen at random), instead of deleting it. *************** *** 172,176 **** ! The sets are grouped into 5 pairs in the obvious way: Spam/Set1 with Ham/Set1, and so on. For each such pair, timtest trains a classifier on that pair, then runs predictions on each of the other 4 pairs. In effect, --- 169,173 ---- ! The sets are grouped into pairs in the obvious way: Spam/Set1 with Ham/Set1, and so on. For each such pair, timtest trains a classifier on that pair, then runs predictions on each of the other 4 pairs. In effect, *************** *** 178,179 **** --- 175,186 ---- to avoid predicting against the same set trained on, except that it takes more time and seems the least interesting thing to try. + + Later, support for N-fold cross validation testing was added, which allows + more accurate measurement of error rates with smaller amounts of training + data. That's recommended now. + + CAUTION: The parititioning of your corpora across directories should + be random. If it isn't, bias creeps in to the test results. This is + usually screamingly obvious under the NxN grid method (rates vary by a + factor of 10 or more across training sets, and even within runs against + a single training set), but harder to spot using N-fold c-v. Index: TestDriver.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** TestDriver.py 13 Sep 2002 19:33:04 -0000 1.3 --- TestDriver.py 14 Sep 2002 00:03:51 -0000 1.4 *************** *** 1,11 **** # Loop: ! # # Set up a new base classifier for testing. ! # train(ham, spam) # # Run tests against (possibly variants of) this classifier. # Loop: ! # Optional: ! # # Forget training for some subset of ham and spam. This ! # # works against the base classifier trained at the start. ! # forget(ham, spam) # # Predict against other data. # Loop: --- 1,15 ---- # Loop: ! # Optional: ! # # Set up a new base classifier for testing. ! # new_classifier() # # Run tests against (possibly variants of) this classifier. # Loop: ! # Loop: ! # Optional: ! # # train on more ham and spam ! # train(ham, spam) ! # Optional: ! # # Forget training for some subset of ham and spam. ! # untrain(ham, spam) # # Predict against other data. # Loop: *************** *** 89,121 **** self.global_spam_hist = Hist(options.nbuckets) self.ntimes_finishtest_called = 0 ! def train(self, ham, spam): c = self.classifier = classifier.GrahamBayes() ! t = self.tester = Tester.Test(c) ! ! print "Training on", ham, "&", spam, "...", ! t.train(ham, spam) ! print c.nham, "hams &", c.nspam, "spams" ! self.trained_ham_hist = Hist(options.nbuckets) self.trained_spam_hist = Hist(options.nbuckets) ! def forget(self, ham, spam): ! import copy ! ! print " forgetting", ham, "&", spam, "...", c = self.classifier nham, nspam = c.nham, c.nspam ! c = copy.deepcopy(c) ! self.tester.set_classifier(c) self.tester.untrain(ham, spam) print nham - c.nham, "hams &", nspam - c.nspam, "spams" - self.global_ham_hist += self.trained_ham_hist - self.global_spam_hist += self.trained_spam_hist - self.trained_ham_hist = Hist(options.nbuckets) - self.trained_spam_hist = Hist(options.nbuckets) - def finishtest(self): if options.show_histograms: --- 93,118 ---- self.global_spam_hist = Hist(options.nbuckets) self.ntimes_finishtest_called = 0 + self.new_classifier() ! def new_classifier(self): c = self.classifier = classifier.GrahamBayes() ! self.tester = Tester.Test(c) self.trained_ham_hist = Hist(options.nbuckets) self.trained_spam_hist = Hist(options.nbuckets) ! def train(self, ham, spam): ! print "-> Training on", ham, "&", spam, "...", c = self.classifier nham, nspam = c.nham, c.nspam ! self.tester.train(ham, spam) ! print c.nham - nham, "hams &", c.nspam- nspam, "spams" + def untrain(self, ham, spam): + print "-> Forgetting", ham, "&", spam, "...", + c = self.classifier + nham, nspam = c.nham, c.nspam self.tester.untrain(ham, spam) print nham - c.nham, "hams &", nspam - c.nspam, "spams" def finishtest(self): if options.show_histograms: *************** *** 124,127 **** --- 121,126 ---- self.global_ham_hist += self.trained_ham_hist self.global_spam_hist += self.trained_spam_hist + self.trained_ham_hist = Hist(options.nbuckets) + self.trained_spam_hist = Hist(options.nbuckets) self.ntimes_finishtest_called += 1 *************** *** 163,177 **** t.reset_test_results() ! print " testing against", ham, "&", spam, "...", t.predict(spam, True, new_spam) t.predict(ham, False, new_ham) ! print t.nham_tested, "hams &", t.nspam_tested, "spams" ! print " false positive:", t.false_positive_rate() ! print " false negative:", t.false_negative_rate() newfpos = Set(t.false_positives()) - self.falsepos self.falsepos |= newfpos ! print " new false positives:", [e.tag for e in newfpos] if not options.show_false_positives: newfpos = () --- 162,179 ---- t.reset_test_results() ! print "-> Predicting", ham, "&", spam, "..." t.predict(spam, True, new_spam) t.predict(ham, False, new_ham) ! print "-> tested", t.nham_tested, "hams &", t.nspam_tested, \ ! "spams against", c.nham, "hams &", c.nspam, "spams" ! print "-> false positive %:", t.false_positive_rate() ! print "-> false negative %:", t.false_negative_rate() newfpos = Set(t.false_positives()) - self.falsepos self.falsepos |= newfpos ! print "-> %d new false positives" % len(newfpos) ! if newfpos: ! print " new fp:", [e.tag for e in newfpos] if not options.show_false_positives: newfpos = () *************** *** 183,187 **** newfneg = Set(t.false_negatives()) - self.falseneg self.falseneg |= newfneg ! print " new false negatives:", [e.tag for e in newfneg] if not options.show_false_negatives: newfneg = () --- 185,191 ---- newfneg = Set(t.false_negatives()) - self.falseneg self.falseneg |= newfneg ! print "-> %d new false negatives" % len(newfneg) ! if newfneg: ! print " new fn:", [e.tag for e in newfneg] if not options.show_false_negatives: newfneg = () Index: cmp.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/cmp.py,v retrieving revision 1.7 retrieving revision 1.8 diff -C2 -d -r1.7 -r1.8 *** cmp.py 12 Sep 2002 19:35:14 -0000 1.7 --- cmp.py 14 Sep 2002 00:03:51 -0000 1.8 *************** *** 16,39 **** # list of all f-n rates, # total f-p, ! # total f-n) # from summary file f. def suck(f): fns = [] fps = [] while 1: ! line = f.readline() if line.startswith('total'): break ! if not line.startswith('Training'): ! # A line with an f-p rate and an f-n rate. ! p, n = map(float, line.split()) ! fps.append(p) ! fns.append(n) ! # "total false pos 8 0.04" ! # "total false neg 249 1.81090909091" ! fptot = int(line.split()[-2]) ! fntot = int(f.readline().split()[-2]) ! return fps, fns, fptot, fntot def tag(p1, p2): --- 16,49 ---- # list of all f-n rates, # total f-p, ! # total f-n, ! # average f-p rate, ! # average f-n rate) # from summary file f. def suck(f): fns = [] fps = [] + get = f.readline while 1: ! line = get() ! if line.startswith('-> tested'): ! print line, ! if line.startswith('-> '): ! continue if line.startswith('total'): break ! # A line with an f-p rate and an f-n rate. ! p, n = map(float, line.split()) ! fps.append(p) ! fns.append(n) ! # "total unique false pos 0" ! # "total unique false neg 0" ! # "average fp % 0.0" ! # "average fn % 0.0" ! fptot = int(line.split()[-1]) ! fntot = int(get().split()[-1]) ! fpmean = float(get().split()[-1]) ! fnmean = float(get().split()[-1]) ! return fps, fns, fptot, fntot, fpmean, fnmean def tag(p1, p2): *************** *** 60,72 **** print - fp1, fn1, fptot1, fntot1 = suck(file(f1n + '.txt')) - fp2, fn2, fptot2, fntot2 = suck(file(f2n + '.txt')) print f1n, '->', f2n print print "false positive percentages" dump(fp1, fp2) print "total unique fp went from", fptot1, "to", fptot2, tag(fptot1, fptot2) print --- 70,84 ---- print print f1n, '->', f2n + fp1, fn1, fptot1, fntot1, fpmean1, fnmean1 = suck(file(f1n + '.txt')) + fp2, fn2, fptot2, fntot2, fpmean2, fnmean2 = suck(file(f2n + '.txt')) + print print "false positive percentages" dump(fp1, fp2) print "total unique fp went from", fptot1, "to", fptot2, tag(fptot1, fptot2) + print "mean fp % went from", fpmean1, "to", fpmean2, tag(fpmean1, fpmean2) print *************** *** 74,75 **** --- 86,88 ---- dump(fn1, fn2) print "total unique fn went from", fntot1, "to", fntot2, tag(fntot1, fntot2) + print "mean fn % went from", fnmean1, "to", fnmean2, tag(fnmean1, fnmean2) Index: mboxtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/mboxtest.py,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** mboxtest.py 13 Sep 2002 16:26:58 -0000 1.4 --- mboxtest.py 14 Sep 2002 00:03:51 -0000 1.5 *************** *** 166,169 **** --- 166,170 ---- for iham, ispam in testsets: + driver.new_classifier() driver.train(mbox(ham, iham), mbox(spam, ispam)) for ihtest, istest in testsets: Index: rates.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/rates.py,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** rates.py 12 Sep 2002 19:35:14 -0000 1.3 --- rates.py 14 Sep 2002 00:03:51 -0000 1.4 *************** *** 2,6 **** """ ! rates.py basename Assuming that file --- 2,6 ---- """ ! rates.py basename ... Assuming that file *************** *** 19,38 **** """ - import re import sys """ ! Training on Data/Ham/Set1 & Data/Spam/Set1 ... 4000 hams & 2750 spams ! testing against Data/Ham/Set2 & Data/Spam/Set2 ... 4000 hams & 2750 spams ! false positive: 0.025 ! false negative: 1.34545454545 ! new false positives: ['Data/Ham/Set2/66645.txt'] """ - pat1 = re.compile(r'\s*Training on ').match - pat2 = re.compile(r'\s+false (positive|negative): (.*)').match - pat3 = re.compile(r"\s+new false (positives|negatives): \[(.+)\]").match def doit(basename): ifile = file(basename + '.txt') oname = basename + 's.txt' ofile = file(oname, 'w') --- 19,38 ---- """ import sys """ ! -> Training on Data/Ham/Set2-3 & Data/Spam/Set2-3 ... 8000 hams & 5500 spams ! -> Predicting Data/Ham/Set1 & Data/Spam/Set1 ... ! -> tested 4000 hams & 2750 spams against 8000 hams & 5500 spams ! -> false positive %: 0.025 ! -> false negative %: 0.327272727273 ! -> 1 new false positives """ def doit(basename): ifile = file(basename + '.txt') + interesting = filter(lambda line: line.startswith('-> '), ifile) + ifile.close() + oname = basename + 's.txt' ofile = file(oname, 'w') *************** *** 44,83 **** print >> ofile, msg ! nfn = nfp = 0 ntrainedham = ntrainedspam = 0 ! for line in ifile: ! "Training on Data/Ham/Set1 & Data/Spam/Set1 ... 4000 hams & 2750 spams" ! m = pat1(line) ! if m: ! dump(line[:-1]) ! fields = line.split() ntrainedham += int(fields[-5]) ntrainedspam += int(fields[-2]) continue ! "false positive: 0.025" ! "false negative: 1.34545454545" ! m = pat2(line) ! if m: ! kind, guts = m.groups() ! guts = float(guts) if kind == 'positive': ! lastval = guts else: ! dump(' %7.3f %7.3f' % (lastval, guts)) continue ! "new false positives: ['Data/Ham/Set2/66645.txt']" ! m = pat3(line) ! if m: # note that it doesn't match at all if the list is "[]" ! kind, guts = m.groups() ! n = len(guts.split()) if kind == 'positives': ! nfp += n else: ! nfn += n ! dump('total false pos', nfp, nfp * 1e2 / ntrainedham) ! dump('total false neg', nfn, nfn * 1e2 / ntrainedspam) for name in sys.argv[1:]: --- 44,91 ---- print >> ofile, msg ! ntests = nfn = nfp = 0 ! sumfnrate = sumfprate = 0.0 ntrainedham = ntrainedspam = 0 ! ! for line in interesting: ! dump(line[:-1]) ! fields = line.split() ! ! # 0 1 2 3 4 5 6 -5 -4 -3 -2 -1 ! #-> tested 4000 hams & 2750 spams against 8000 hams & 5500 spams ! if line.startswith('-> tested '): ntrainedham += int(fields[-5]) ntrainedspam += int(fields[-2]) + ntests += 1 continue ! # 0 1 2 3 ! # -> false positive %: 0.025 ! # -> false negative %: 0.327272727273 ! if line.startswith('-> false '): ! kind = fields[3] ! percent = float(fields[-1]) if kind == 'positive': ! sumfprate += percent ! lastval = percent else: ! sumfnrate += percent ! dump(' %7.3f %7.3f' % (lastval, percent)) continue ! # 0 1 2 3 4 5 ! # -> 1 new false positives ! if fields[3] == 'new' and fields[4] == 'false': ! kind = fields[-1] ! count = int(fields[2]) if kind == 'positives': ! nfp += count else: ! nfn += count ! dump('total unique false pos', nfp) ! dump('total unique false neg', nfn) ! dump('average fp %', sumfprate / ntests) ! dump('average fn %', sumfnrate / ntests) for name in sys.argv[1:]: Index: timcv.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timcv.py,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** timcv.py 13 Sep 2002 20:35:37 -0000 1.3 --- timcv.py 14 Sep 2002 00:03:51 -0000 1.4 *************** *** 77,85 **** d = Driver() ! # Train it on all the data. ! d.train(MsgStream("%s-%d" % (hamdirs[0], nsets), hamdirs), ! MsgStream("%s-%d" % (spamdirs[0], nsets), spamdirs)) ! # Now run nsets times, removing one pair per run. for i in range(nsets): h = hamdirs[i] --- 77,85 ---- d = Driver() ! # Train it on all sets except the first. ! d.train(MsgStream("%s-%d" % (hamdirs[1], nsets), hamdirs[1:]), ! MsgStream("%s-%d" % (spamdirs[1], nsets), spamdirs[1:])) ! # Now run nsets times, predicting pair i against all except pair i. for i in range(nsets): h = hamdirs[i] *************** *** 87,93 **** hamstream = MsgStream(h, [h]) spamstream = MsgStream(s, [s]) ! d.forget(hamstream, spamstream) d.test(hamstream, spamstream) d.finishtest() d.alldone() --- 87,103 ---- hamstream = MsgStream(h, [h]) spamstream = MsgStream(s, [s]) ! ! if i > 0: ! # Forget this set. ! d.untrain(hamstream, spamstream) ! ! # Predict this set. d.test(hamstream, spamstream) d.finishtest() + + if i < nsets - 1: + # Add this set back in. + d.train(hamstream, spamstream) + d.alldone() Index: timtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timtest.py,v retrieving revision 1.25 retrieving revision 1.26 diff -C2 -d -r1.25 -r1.26 *** timtest.py 13 Sep 2002 18:48:42 -0000 1.25 --- timtest.py 14 Sep 2002 00:03:51 -0000 1.26 *************** *** 74,78 **** random.seed(hash(directory)) random.shuffle(all) ! for fname in all[-1500:-1000:]: yield Msg(directory, fname) --- 74,78 ---- random.seed(hash(directory)) random.shuffle(all) ! for fname in all[-1500:-1300:]: yield Msg(directory, fname) *************** *** 89,92 **** --- 89,93 ---- d = Driver() for spamdir, hamdir in spamhamdirs: + d.new_classifier() d.train(MsgStream(hamdir), MsgStream(spamdir)) for sd2, hd2 in spamhamdirs: From tim_one@users.sourceforge.net Sat Sep 14 04:32:49 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Fri, 13 Sep 2002 20:32:49 -0700 Subject: [Spambayes-checkins] spambayes rebal.py,1.2,1.3 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv832 Modified Files: rebal.py Log Message: migrate(): If there's a file extension, preserve it instead of blowing up (files w/o extensions are a PITA on Windows). Also replaced the renaming strategy w/ a randomized scheme that should run much faster. Index: rebal.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/rebal.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** rebal.py 12 Sep 2002 19:33:54 -0000 1.2 --- rebal.py 14 Sep 2002 03:32:47 -0000 1.3 *************** *** 18,22 **** must already exist. ! Example: rebal.py -r reservoir -s Set -n 300 --- 18,22 ---- must already exist. ! Example: rebal.py -r reservoir -s Set -n 300 *************** *** 64,83 **** -Q - be quiet and don't confirm moves """ % globals() ! def migrate(f, dir, verbose): """rename f into dir, making sure to avoid name clashes.""" base = os.path.split(f)[-1] ! if os.path.exists(os.path.join(dir,base)): ! # this path can get slow if we have a lot of name collisions ! # but we should rarely encounter that case (so he says smugly) ! reslist = [int(n) for n in os.listdir(dir)] ! reslist.sort() ! out = os.path.join(dir, "%d"%(reslist[-1]+1)) ! else: ! out = os.path.join(dir, base) if verbose: print "moving", f, "to", out os.rename(f, out) ! def main(args): nperdir = NPERDIR --- 64,80 ---- -Q - be quiet and don't confirm moves """ % globals() ! def migrate(f, dir, verbose): """rename f into dir, making sure to avoid name clashes.""" base = os.path.split(f)[-1] ! out = os.path.join(dir, base) ! while os.path.exists(out): ! basename, ext = os.path.splitext(base) ! digits = random.randrange(100000000) ! out = os.path.join(dir, str(digits) + ext) if verbose: print "moving", f, "to", out os.rename(f, out) ! def main(args): nperdir = NPERDIR *************** *** 86,90 **** verbose = VERBOSE confirm = CONFIRM ! try: opts, args = getopt.getopt(args, "r:s:n:vqcQh") --- 83,87 ---- verbose = VERBOSE confirm = CONFIRM ! try: opts, args = getopt.getopt(args, "r:s:n:vqcQh") From tim_one@users.sourceforge.net Sat Sep 14 21:08:09 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sat, 14 Sep 2002 13:08:09 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.13,1.14 timcv.py,1.4,1.5 tokenizer.py,1.21,1.22 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv11259 Modified Files: Options.py timcv.py tokenizer.py Log Message: New option [Tokenizer]ignore_redundant_html, defaulting to False. This may change results! Read the comments in tokenizer.py and Options.py. Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.13 retrieving revision 1.14 diff -C2 -d -r1.13 -r1.14 *** Options.py 13 Sep 2002 16:26:58 -0000 1.13 --- Options.py 14 Sep 2002 20:08:07 -0000 1.14 *************** *** 13,21 **** defaults = """ [Tokenizer] ! # By default, tokenizer.Tokenizer.tokenize_headers() strips HTML tags ! # from pure text/html messages. Set to True to retain HTML tags in ! # this case. retain_pure_html_tags: False # Generate tokens just counting the number of instances of each kind of # header line, in a case-sensitive way. --- 13,33 ---- defaults = """ [Tokenizer] ! # If false, tokenizer.Tokenizer.tokenize_body() strips HTML tags ! # from pure text/html messages. Set true to retain HTML tags in this ! # case. On the c.l.py corpus, it helps to set this true because any ! # sign of HTML is so despised on tech lists; however, the advantage ! # of setting it true eventually vanishes even there given enough ! # training data. If you set this true, you should almost certainly set ! # ignore_redundant_html true too. retain_pure_html_tags: False + # If true, when a multipart/alternative has both text/plain and text/html + # sections, the text/html section is ignored. That's likely a dubious + # idea in general, so false is likely a better idea here. In the c.l.py + # tests, it helped a lot when retain_pure_html_tags was true (in that case, + # keeping the HTML tags in the "redundant" HTML was almost certain to score + # the multipart/alternative as spam, regardless of content). + ignore_redundant_html: False + # Generate tokens just counting the number of instances of each kind of # header line, in a case-sensitive way. *************** *** 116,119 **** --- 128,132 ---- all_options = { 'Tokenizer': {'retain_pure_html_tags': boolean_cracker, + 'ignore_redundant_html': boolean_cracker, 'safe_headers': ('get', lambda s: Set(s.split())), 'count_all_header_lines': boolean_cracker, Index: timcv.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timcv.py,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** timcv.py 14 Sep 2002 00:03:51 -0000 1.4 --- timcv.py 14 Sep 2002 20:08:07 -0000 1.5 *************** *** 67,70 **** --- 67,80 ---- yield Msg(directory, fname) + def xproduce(self): + import random + keep = 'Spam' in self.directories[0] and 300 or 300 + for directory in self.directories: + all = os.listdir(directory) + random.seed(hash(max(all)) ^ 0x12345678) # reproducible across calls + random.shuffle(all) + for fname in all[:keep]: + yield Msg(directory, fname) + def __iter__(self): return self.produce() Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.21 retrieving revision 1.22 diff -C2 -d -r1.21 -r1.22 *** tokenizer.py 13 Sep 2002 02:40:50 -0000 1.21 --- tokenizer.py 14 Sep 2002 20:08:07 -0000 1.22 *************** *** 470,473 **** --- 470,482 ---- # of c.l.py traffic. Again, this should be revisited if the f-n rate is # slashed again. + # + # Later: As the amount of training data increased, the effect of retaining + # HTML tags decreased to insignificance. options.retain_pure_html_tags + # was introduced to control this, and it defaults to False. + # + # Later: The decision to ignore "redundant" HTML is also dubious, since + # the text/plain and text/html alternatives may have entirely different + # content. options.ignore_redundant_html was introduced to control this, + # and it defaults to False. ############################################################################## *************** *** 492,531 **** ! # Find all the text components of the msg. There's no point decoding ! # binary blobs (like images). If a multipart/alternative has both plain ! # text and HTML versions of a msg, ignore the HTML part: HTML decorations ! # have monster-high spam probabilities, and innocent newbies often post ! # using HTML. ! def textparts(msg): ! text = Set() ! redundant_html = Set() ! for part in msg.walk(): ! if part.get_content_type() == 'multipart/alternative': ! # Descend this part of the tree, adding any redundant HTML text ! # part to redundant_html. ! htmlpart = textpart = None ! stack = part.get_payload()[:] ! while stack: ! subpart = stack.pop() ! ctype = subpart.get_content_type() ! if ctype == 'text/plain': ! textpart = subpart ! elif ctype == 'text/html': ! htmlpart = subpart ! elif ctype == 'multipart/related': ! stack.extend(subpart.get_payload()) ! if textpart is not None: ! text.add(textpart) ! if htmlpart is not None: ! redundant_html.add(htmlpart) ! elif htmlpart is not None: ! text.add(htmlpart) ! elif part.get_content_maintype() == 'text': ! text.add(part) ! return text - redundant_html url_re = re.compile(r""" --- 501,548 ---- + # textparts(msg) returns a set containing all the text components of msg. + # There's no point decoding binary blobs (like images). ! if options.ignore_redundant_html: ! # If a multipart/alternative has both plain text and HTML versions of a ! # msg, ignore the HTML part: HTML decorations have monster-high spam ! # probabilities, and innocent newbies often post using HTML. ! def textparts(msg): ! text = Set() ! redundant_html = Set() ! for part in msg.walk(): ! if part.get_content_type() == 'multipart/alternative': ! # Descend this part of the tree, adding any redundant HTML text ! # part to redundant_html. ! htmlpart = textpart = None ! stack = part.get_payload()[:] ! while stack: ! subpart = stack.pop() ! ctype = subpart.get_content_type() ! if ctype == 'text/plain': ! textpart = subpart ! elif ctype == 'text/html': ! htmlpart = subpart ! elif ctype == 'multipart/related': ! stack.extend(subpart.get_payload()) ! if textpart is not None: ! text.add(textpart) ! if htmlpart is not None: ! redundant_html.add(htmlpart) ! elif htmlpart is not None: ! text.add(htmlpart) ! elif part.get_content_maintype() == 'text': ! text.add(part) ! return text - redundant_html ! ! else: ! # Use all text parts. If a text/plain and text/html part happen to ! # have redundant content, so it goes. ! def textparts(msg): ! return Set(filter(lambda part: part.get_content_maintype() == 'text', ! msg.walk())) url_re = re.compile(r""" From tim_one@users.sourceforge.net Sat Sep 14 23:01:46 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sat, 14 Sep 2002 15:01:46 -0700 Subject: [Spambayes-checkins] spambayes timcv.py,1.5,1.6 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv8230 Modified Files: timcv.py Log Message: Introduced new optional arguments to use only part of the ham and spam in each set. This helps those with larger corpora to run tests as if they had less. Index: timcv.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timcv.py,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** timcv.py 14 Sep 2002 20:08:07 -0000 1.5 --- timcv.py 14 Sep 2002 22:01:42 -0000 1.6 *************** *** 4,8 **** # A driver for N-fold cross validation. ! """Usage: %(program)s [-h] -n nsets Where: --- 4,8 ---- # A driver for N-fold cross validation. ! """Usage: %(program)s [options] -n nsets Where: *************** *** 13,16 **** --- 13,31 ---- This is required. + If you only want to use some of the messages in each set, + + --ham-keep int + The maximum number of msgs to use from each Ham set. The msgs are + chosen randomly. See also the -s option. + + --spam-keep int + The maximum number of msgs to use from each Spam set. The msgs are + chosen randomly. See also the -s option. + + -s int + A seed for the random number generator. Has no effect unless + at least on of {--ham-keep, --spam-keep} is specified. If -s + isn't specifed, the seed is taken from current time. + In addition, an attempt is made to merge bayescustomize.ini into the options. If that exists, it can be used to change the settings in Options.options. *************** *** 19,26 **** import os import sys from Options import options from tokenizer import tokenize ! from TestDriver import Driver program = sys.argv[0] --- 34,46 ---- import os import sys + import random from Options import options from tokenizer import tokenize ! import TestDriver ! ! HAMKEEP = None ! SPAMKEEP = None ! SEED = random.randrange(2000000000) program = sys.argv[0] *************** *** 35,38 **** --- 55,60 ---- class Msg(object): + __slots__ = 'tag', 'guts' + def __init__(self, dir, name): path = dir + "/" + name *************** *** 45,48 **** --- 67,71 ---- return tokenize(self.guts) + # Compare msgs by their paths; this is appropriate for sets of msgs. def __hash__(self): return hash(self.tag) *************** *** 55,61 **** class MsgStream(object): ! def __init__(self, tag, directories): self.tag = tag self.directories = directories def __str__(self): --- 78,87 ---- class MsgStream(object): ! __slots__ = 'tag', 'directories', 'keep' ! ! def __init__(self, tag, directories, keep=None): self.tag = tag self.directories = directories + self.keep = keep def __str__(self): *************** *** 63,78 **** def produce(self): ! for directory in self.directories: ! for fname in os.listdir(directory): ! yield Msg(directory, fname) ! ! def xproduce(self): ! import random ! keep = 'Spam' in self.directories[0] and 300 or 300 for directory in self.directories: all = os.listdir(directory) ! random.seed(hash(max(all)) ^ 0x12345678) # reproducible across calls random.shuffle(all) ! for fname in all[:keep]: yield Msg(directory, fname) --- 89,107 ---- def produce(self): ! if self.keep is None: ! for directory in self.directories: ! for fname in os.listdir(directory): ! yield Msg(directory, fname) ! return ! # We only want part of the msgs. Shuffle each directory list, but ! # in such a way that we'll get the same result each time this is ! # called on the same directory list. for directory in self.directories: all = os.listdir(directory) ! random.seed(hash(max(all)) ^ SEED) # reproducible across calls random.shuffle(all) ! del all[self.keep:] ! all.sort() # seems to speed access on Win98! ! for fname in all: yield Msg(directory, fname) *************** *** 80,83 **** --- 109,120 ---- return self.produce() + class HamStream(MsgStream): + def __init__(self, tag, directories): + MsgStream.__init__(self, tag, directories, HAMKEEP) + + class SpamStream(MsgStream): + def __init__(self, tag, directories): + MsgStream.__init__(self, tag, directories, SPAMKEEP) + def drive(nsets): print options.display() *************** *** 86,93 **** spamdirs = ["Data/Spam/Set%d" % i for i in range(1, nsets+1)] ! d = Driver() # Train it on all sets except the first. ! d.train(MsgStream("%s-%d" % (hamdirs[1], nsets), hamdirs[1:]), ! MsgStream("%s-%d" % (spamdirs[1], nsets), spamdirs[1:])) # Now run nsets times, predicting pair i against all except pair i. --- 123,130 ---- spamdirs = ["Data/Spam/Set%d" % i for i in range(1, nsets+1)] ! d = TestDriver.Driver() # Train it on all sets except the first. ! d.train(HamStream("%s-%d" % (hamdirs[1], nsets), hamdirs[1:]), ! SpamStream("%s-%d" % (spamdirs[1], nsets), spamdirs[1:])) # Now run nsets times, predicting pair i against all except pair i. *************** *** 95,100 **** h = hamdirs[i] s = spamdirs[i] ! hamstream = MsgStream(h, [h]) ! spamstream = MsgStream(s, [s]) if i > 0: --- 132,137 ---- h = hamdirs[i] s = spamdirs[i] ! hamstream = HamStream(h, [h]) ! spamstream = SpamStream(s, [s]) if i > 0: *************** *** 112,124 **** d.alldone() ! if __name__ == "__main__": import getopt try: ! opts, args = getopt.getopt(sys.argv[1:], 'hn:') except getopt.error, msg: usage(1, msg) ! nsets = None for opt, arg in opts: if opt == '-h': --- 149,163 ---- d.alldone() ! def main(): ! global SEED, HAMKEEP, SPAMKEEP import getopt try: ! opts, args = getopt.getopt(sys.argv[1:], 'hn:s:', ! ['ham-keep=', 'spam-keep=']) except getopt.error, msg: usage(1, msg) ! nsets = seed = None for opt, arg in opts: if opt == '-h': *************** *** 126,129 **** --- 165,174 ---- elif opt == '-n': nsets = int(arg) + elif opt == '-s': + seed = int(arg) + elif opt == '--ham-keep': + HAMKEEP = int(arg) + elif opt == '--spam-keep': + SPAMKEEP = int(arg) if args: *************** *** 131,134 **** --- 176,184 ---- if nsets is None: usage(1, "-n is required") + if seed is not None: + SEED = seed drive(nsets) + + if __name__ == "__main__": + main() From tim_one@users.sourceforge.net Sat Sep 14 23:18:27 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sat, 14 Sep 2002 15:18:27 -0700 Subject: [Spambayes-checkins] spambayes README.txt,1.17,1.18 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv12179 Modified Files: README.txt Log Message: Various comment updates. Index: README.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/README.txt,v retrieving revision 1.17 retrieving revision 1.18 diff -C2 -d -r1.17 -r1.18 *** README.txt 14 Sep 2002 00:03:51 -0000 1.17 --- README.txt 14 Sep 2002 22:18:24 -0000 1.18 *************** *** 105,108 **** --- 105,112 ---- the script for an operational definition of "loose". + rebal.py + Evens out the number of messages in "standard" test data folders (see + below). Needs generalization (e.g., Ham and 4000 are hardcoded now). + mboxcount.py Count the number of messages (both parseable and unparseable) in *************** *** 117,127 **** Like splitn.py (above), but splits an mbox into one message per file in "the standard" directory structure (see below). This does an ! approximate split; rebal.by (below) can be used afterwards to even out the number of messages per folder. - rebal.py - Evens out the number of messages in "standard" test data folders (see - below). Needs generalization (e.g., Ham and 4000 are hardcoded now). - Standard Test Data Setup --- 121,127 ---- Like splitn.py (above), but splits an mbox into one message per file in "the standard" directory structure (see below). This does an ! approximate split; rebal.py (above) can be used afterwards to even out the number of messages per folder. Standard Test Data Setup *************** *** 133,156 **** random when testing reveals spam mistakenly called ham (and vice versa), etc -- even pasting examples into email is much easier when it's one msg ! per file (and the test driver makes it easy to print a msg's file path). The directory structure under my spambayes directory looks like so: - [But due to a better testing infrastructure, I'm going to spread this - across 20 subdirectories under Spam and under Ham, and use groups - of 10 for 10-fold cross validation] Data/ Spam/ ! Set1/ (contains 2750 spam .txt files) Set2/ "" Set3/ "" Set4/ "" Set5/ "" Ham/ ! Set1/ (contains 4000 ham .txt files) Set2/ "" Set3/ "" Set4/ "" Set5/ "" reservoir/ (contains "backup ham") --- 133,163 ---- random when testing reveals spam mistakenly called ham (and vice versa), etc -- even pasting examples into email is much easier when it's one msg ! per file (and the test drivers make it easy to print a msg's file path). The directory structure under my spambayes directory looks like so: Data/ Spam/ ! Set1/ (contains 1375 spam .txt files) Set2/ "" Set3/ "" Set4/ "" Set5/ "" + Set6/ "" + Set7/ "" + Set9/ "" + Set9/ "" + Set10/ "" Ham/ ! Set1/ (contains 2000 ham .txt files) Set2/ "" Set3/ "" Set4/ "" Set5/ "" + Set6/ "" + Set7/ "" + Set8/ "" + Set9/ "" + Set10/ "" reservoir/ (contains "backup ham") *************** *** 159,166 **** want at least a few hundred messages in each one. The "reservoir" directory contains a few thousand other random hams. When a ham is found that's ! really spam, I delete it, and then the rebal.py utility moves in a message ! at random from the reservoir to replace it. If I had it to do over ! again, I think I'd move such spam into a Spam set (chosen at random), ! instead of deleting it. The hams are 20,000 msgs selected at random from a python-list archive. --- 166,171 ---- want at least a few hundred messages in each one. The "reservoir" directory contains a few thousand other random hams. When a ham is found that's ! really spam, move into a spam directory, and then the rebal.py utility ! moves in a random message from the reservoir to replace it. The hams are 20,000 msgs selected at random from a python-list archive. *************** *** 171,176 **** The sets are grouped into pairs in the obvious way: Spam/Set1 with Ham/Set1, and so on. For each such pair, timtest trains a classifier on ! that pair, then runs predictions on each of the other 4 pairs. In effect, ! it's a 5x5 test grid, skipping the diagonal. There's no particular reason to avoid predicting against the same set trained on, except that it takes more time and seems the least interesting thing to try. --- 176,181 ---- The sets are grouped into pairs in the obvious way: Spam/Set1 with Ham/Set1, and so on. For each such pair, timtest trains a classifier on ! that pair, then runs predictions on each of the other pairs. In effect, ! it's a NxN test grid, skipping the diagonal. There's no particular reason to avoid predicting against the same set trained on, except that it takes more time and seems the least interesting thing to try. *************** *** 178,182 **** Later, support for N-fold cross validation testing was added, which allows more accurate measurement of error rates with smaller amounts of training ! data. That's recommended now. CAUTION: The parititioning of your corpora across directories should --- 183,189 ---- Later, support for N-fold cross validation testing was added, which allows more accurate measurement of error rates with smaller amounts of training ! data. That's recommended now. timcv.py is to cross-validation testing ! as the older timtest.py is to grid testing. timcv.py has grown additional ! arguments to allow using only a random subset of messages in each Set. CAUTION: The parititioning of your corpora across directories should From tim_one@users.sourceforge.net Sun Sep 15 01:01:50 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sat, 14 Sep 2002 17:01:50 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.14,1.15 classifier.py,1.9,1.10 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv6591 Modified Files: Options.py classifier.py Log Message: New bool option [Classifier]adjust_probs_by_evidence_mass. See the mailing list for details. By default, this is turned off. Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.14 retrieving revision 1.15 diff -C2 -d -r1.14 -r1.15 *** Options.py 14 Sep 2002 20:08:07 -0000 1.14 --- Options.py 15 Sep 2002 00:01:48 -0000 1.15 *************** *** 119,122 **** --- 119,126 ---- max_discriminators: 16 + + # Speculative change to allow giving probabilities more weight the more + # messages went into computing them. + adjust_probs_by_evidence_mass: False """ *************** *** 152,155 **** --- 156,160 ---- 'unknown_spamprob': float_cracker, 'max_discriminators': int_cracker, + 'adjust_probs_by_evidence_mass': boolean_cracker, }, } Index: classifier.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/classifier.py,v retrieving revision 1.9 retrieving revision 1.10 diff -C2 -d -r1.9 -r1.10 *** classifier.py 13 Sep 2002 19:46:41 -0000 1.9 --- classifier.py 15 Sep 2002 00:01:48 -0000 1.10 *************** *** 547,550 **** --- 547,551 ---- nham = float(self.nham or 1) nspam = float(self.nspam or 1) + fiddle = options.adjust_probs_by_evidence_mass for word,record in self.wordinfo.iteritems(): # Compute prob(msg is spam | msg contains word). *************** *** 560,570 **** prob = MAX_SPAMPROB ! ! ## if prob != 0.5: ! ## confbias = 0.01 / (record.hamcount + record.spamcount) ! ## if prob > 0.5: ! ## prob = max(0.5, prob - confbias) ! ## else: ! ## prob = min(0.5, prob + confbias) if record.spamprob != prob: --- 561,581 ---- prob = MAX_SPAMPROB ! if fiddle: ! # Suppose two clues have spamprob 0.99. Which one is better? ! # One reasonable guess is that it's the one derived from the ! # most data. This code fiddles non-0.5 probabilities by ! # shrinking their distance to 0.5, but shrinking less the ! # more evidence went into computing them. Note that if this ! # proves to work, it should allow getting rid of the ! # "cancelling evidence" complications in spamprob() ! # (two probs exactly the same distance from 0.5 are far ! # less common after this transformation; instead, spamprob() ! # will pick up on the clues with the most evidence backing ! # them up). ! dist = prob - 0.5 ! if dist: ! sum = float(record.hamcount + record.spamcount) ! dist *= sum / (sum + 1.0) ! prob = 0.5 + dist if record.spamprob != prob: From tim_one@users.sourceforge.net Sun Sep 15 08:45:33 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sun, 15 Sep 2002 00:45:33 -0700 Subject: [Spambayes-checkins] spambayes classifier.py,1.10,1.11 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv8018 Modified Files: classifier.py Log Message: update_probabilities: rearranged the base computation to make more sense, and refined the optional "evidence mass" fiddling. To try this as intended, you have to change *four* classifier options at the same time: [Classifier] adjust_probs_by_evidence_mass: True min_spamprob: 0.001 max_spamprob: 0.999 hambias: 1.5 See discussion on the mailing list. Index: classifier.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/classifier.py,v retrieving revision 1.10 retrieving revision 1.11 diff -C2 -d -r1.10 -r1.11 *** classifier.py 15 Sep 2002 00:01:48 -0000 1.10 --- classifier.py 15 Sep 2002 07:45:31 -0000 1.11 *************** *** 550,557 **** for word,record in self.wordinfo.iteritems(): # Compute prob(msg is spam | msg contains word). ! hamcount = HAMBIAS * record.hamcount ! spamcount = SPAMBIAS * record.spamcount ! hamratio = min(1.0, hamcount / nham) ! spamratio = min(1.0, spamcount / nspam) prob = spamratio / (hamratio + spamratio) --- 550,557 ---- for word,record in self.wordinfo.iteritems(): # Compute prob(msg is spam | msg contains word). ! hamcount = min(HAMBIAS * record.hamcount, nham) ! spamcount = min(SPAMBIAS * record.spamcount, nspam) ! hamratio = hamcount / nham ! spamratio = spamcount / nspam prob = spamratio / (hamratio + spamratio) *************** *** 574,581 **** # them up). dist = prob - 0.5 ! if dist: ! sum = float(record.hamcount + record.spamcount) ! dist *= sum / (sum + 1.0) ! prob = 0.5 + dist if record.spamprob != prob: --- 574,580 ---- # them up). dist = prob - 0.5 ! sum = hamcount + spamcount ! dist *= sum / (sum + 0.1) ! prob = 0.5 + dist if record.spamprob != prob: From richiehindle@users.sourceforge.net Mon Sep 16 08:57:22 2002 From: richiehindle@users.sourceforge.net (Richie Hindle) Date: Mon, 16 Sep 2002 00:57:22 -0700 Subject: [Spambayes-checkins] spambayes pop3proxy.py,NONE,1.1 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv5459 Added Files: pop3proxy.py Log Message: pop3proxy.py is a spam-classifying POP3 proxy, plus associated test code. --- NEW FILE: pop3proxy.py --- #!/usr/bin/env python # pop3proxy is released under the terms of the following MIT-style license: # # Copyright (c) Entrian Solutions 2002 # # Permission is hereby granted, free of charge, to any person obtaining a # copy of this software and associated documentation files (the "Software"), # to deal in the Software without restriction, including without limitation # the rights to use, copy, modify, merge, publish, distribute, sublicense, # and/or sell copies of the Software, and to permit persons to whom the # Software is furnished to do so, subject to the following conditions: # # The above copyright notice and this permission notice shall be included in # all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL # THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING # FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER # DEALINGS IN THE SOFTWARE. """A POP3 proxy designed to work with classifier.py, to add an X-Bayes-Score header to each incoming email. The header gives a floating point number between 0.00 and 1.00, to two decimal places. You point pop3proxy at your POP3 server, and configure your email client to collect mail from the proxy and filter on the X-Bayes-Score header. Usage: pop3proxy.py [options] [] is the name of your real POP3 server is the port number of your real POP3 server, which defaults to 110. options (the same as hammie): -p FILE : use the named data file -d : the file is a DBM file rather than a pickle pop3proxy -t Runs a test POP3 server on port 8110; useful for testing. pop3proxy -h Displays this help message. For safety, and to help debugging, the whole POP3 conversation is written out to _pop3proxy.log for each run. """ import sys, re, operator, errno, getopt, cPickle, socket, asyncore, asynchat import classifier, tokenizer, hammie from classifier import GrahamBayes, WordInfo # So we can unpickle these. HEADER_FORMAT = 'X-Bayes-Score: %1.2f\r\n' HEADER_EXAMPLE = 'X-Bayes-Score: 0.12\r\n' class Listener( asyncore.dispatcher ): """Listens for incoming socket connections and spins off dispatchers created by a factory callable.""" def __init__( self, port, factory, factoryArgs=(), socketMap=asyncore.socket_map ): asyncore.dispatcher.__init__( self, map=socketMap ) self.socketMap = socketMap self.factory = factory self.factoryArgs = factoryArgs s = socket.socket( socket.AF_INET, socket.SOCK_STREAM ) s.setblocking( False ) self.set_socket( s, socketMap ) self.set_reuse_addr() self.bind( ( '', port ) ) self.listen( 5 ) def handle_accept( self ): clientSocket, clientAddress = self.accept() args = [ clientSocket ] + list( self.factoryArgs ) if self.socketMap != asyncore.socket_map: self.factory( *args, **{ 'socketMap': self.socketMap } ) else: self.factory( *args ) class POP3ProxyBase( asynchat.async_chat ): """An async dispatcher that understands POP3 and proxies to a POP3 server, calling `self.onTransaction( request, response )` for each transaction. Responses are not un-byte-stuffed before reaching self.onTransaction() (they probably should be for a totally generic POP3ProxyBase class, but BayesProxy doesn't need it and it would mean re-stuffing them afterwards). self.onTransaction() should return the response to pass back to the email client - the response can be the verbatim response or a processed version of it. The special command 'KILL' kills it (passing a 'QUIT' command to the server).""" def __init__( self, clientSocket, serverName, serverPort ): asynchat.async_chat.__init__( self, clientSocket ) self.request = '' self.isClosing = False self.set_terminator( '\r\n' ) serverSocket = socket.socket( socket.AF_INET, socket.SOCK_STREAM ) serverSocket.connect( ( serverName, serverPort ) ) self.serverFile = serverSocket.makefile() self.push( self.serverFile.readline() ) def handle_connect( self ): """Suppress the asyncore "unhandled connect event" warning.""" pass def onTransaction( self, command, args, response ): """Overide this. Takes the raw request and the response, and returns the (possibly processed) response to pass back to the email client.""" raise NotImplementedError def isMultiline( self, command, args ): """Returns True if the given request should get a multiline response (assuming the response is positive).""" if command in [ 'USER', 'PASS', 'APOP', 'QUIT', 'STAT', 'DELE', 'NOOP', 'RSET', 'KILL' ]: return False elif command in [ 'RETR', 'TOP' ]: return True elif command in [ 'LIST', 'UIDL' ]: return len( args ) == 0 else: # Assume that unknown commands will get an error response. return False def readResponse( self, command, args ): """Reads the POP3 server's response. Also sets self.isClosing to True if the server closes the socket, which tells found_terminator() to close when the response has been sent.""" isMulti = self.isMultiline( command, args ) responseLines = [] isFirstLine = True while True: line = self.serverFile.readline() if not line: # The socket has been closed by the server, probably by QUIT. self.isClosing = True break elif not isMulti or ( isFirstLine and line.startswith( '-ERR' ) ): # A single-line response. responseLines.append( line ) break elif line == '.\r\n': # The termination line. responseLines.append( line ) break else: # A normal line - append it to the response and carry on. responseLines.append( line ) isFirstLine = False return ''.join( responseLines ) def collect_incoming_data( self, data ): """Asynchat override.""" self.request = self.request + data def found_terminator( self ): """Asynchat override.""" # Send the request to the server and read the reply. # XXX When the response is huge, the email client can time out. # It should read as much as it can from the server, then if the # response is still coming after say 30 seconds, it should classify # the message based on that and send back the headers and the body # so far. Then it should become a simple one-packet-at-a-time proxy # for the rest of the response. if self.request.strip().upper() == 'KILL': self.serverFile.write( 'QUIT\r\n' ) self.serverFile.flush() self.send( "+OK, dying.\r\n" ) self.shutdown( 2 ) self.close() raise SystemExit self.serverFile.write( self.request + '\r\n' ) self.serverFile.flush() if self.request.strip() == '': # Someone just hit the Enter key. command, args = ( '', '' ) else: splitCommand = self.request.strip().split( None, 1 ) command = splitCommand[ 0 ].upper() args = splitCommand[ 1: ] rawResponse = self.readResponse( command, args ) # Pass the request/reply to the subclass and send back its response. cookedResponse = self.onTransaction( command, args, rawResponse ) self.push( cookedResponse ) self.request = '' # If readResponse() decided that the server had closed its socket, # close this one when the response has been sent. if self.isClosing: self.close_when_done() def handle_error( self ): """Let SystemExit cause an exit.""" type, v, t = sys.exc_info() if type == SystemExit: raise else: asynchat.async_chat.handle_error( self ) class BayesProxyListener( Listener ): """Listens for incoming email client connections and spins off BayesProxy objects to serve them.""" def __init__( self, serverName, serverPort, proxyPort, bayes ): proxyArgs = ( serverName, serverPort, bayes ) Listener.__init__( self, proxyPort, BayesProxy, proxyArgs ) class BayesProxy( POP3ProxyBase ): """Proxies between an email client and a POP3 server, inserting X-Bayes-Score headers. It acts on the following POP3 commands: o STAT: o Adds the size of all the X-Bayes-Score headers to the maildrop size. o LIST: o With no message number: adds the size of an X-Bayes-Score header to the message size for each message in the scan listing. o With a message number: adds the size of an X-Bayes-Score header to the message size. o RETR: o Adds the X-Bayes-Score header based on the raw headers and body of the message. o TOP: o Adds the X-Bayes-Score header based on the raw headers and as much of the body as the TOP command retrieves. This can mean that the header might have a different value for different calls to TOP, or for calls to TOP vs. calls to RETR. I'm assuming that the email client will either not make multiple calls, or will cope with the headers being different. """ def __init__( self, clientSocket, serverName, serverPort, bayes ): # Open the log file *before* calling __init__ for the base class, # 'cos that might call send or recv. self.bayes = bayes self.logFile = open( '_pop3proxy.log', 'wb' ) POP3ProxyBase.__init__( self, clientSocket, serverName, serverPort ) self.handlers = { 'STAT': self.onStat, 'LIST': self.onList, 'RETR': self.onRetr, 'TOP': self.onTop } def send( self, data ): """Logs the data to the log file.""" self.logFile.write( data ) self.logFile.flush() return POP3ProxyBase.send( self, data ) def recv( self, size ): """Logs the data to the log file.""" data = POP3ProxyBase.recv( self, size ) self.logFile.write( data ) self.logFile.flush() return data def onTransaction( self, command, args, response ): """Takes the raw request and response, and returns the (possibly processed) response to pass back to the email client.""" handler = self.handlers.get( command, self.onUnknown ) return handler( command, args, response ) def onStat( self, command, args, response ): """Adds the size of all the X-Bayes-Score headers to the maildrop size.""" match = re.search( r'^\+OK\s+(\d+)\s+(\d+)(.*)\r\n', response ) if match: count = int( match.group( 1 ) ) size = int( match.group( 2 ) ) + len( HEADER_EXAMPLE ) * count return '+OK %d %d%s\r\n' % ( count, size, match.group( 3 ) ) else: return response def onList( self, command, args, response ): """Adds the size of an X-Bayes-Score header to the message size(s).""" if response.count( '\r\n' ) > 1: # Multiline: all lines but the first contain a message size. lines = response.split( '\r\n' ) outputLines = [ lines[ 0 ] ] for line in lines[ 1: ]: match = re.search( '^(\d+)\s+(\d+)', line ) if match: number = int( match.group( 1 ) ) size = int( match.group( 2 ) ) + len( HEADER_EXAMPLE ) line = "%d %d" % ( number, size ) outputLines.append( line ) return '\r\n'.join( outputLines ) else: # Single line. match = re.search( '^\+OK\s+(\d+)(.*)\r\n', response ) if match: size = int( match.group( 1 ) ) + len( HEADER_EXAMPLE ) return "+OK %d%s\r\n" % ( size, match.group( 2 ) ) else: return response def onRetr( self, command, args, response ): """Adds the X-Bayes-Score header based on the raw headers and body of the message.""" # Use '\n\r?\n' to detect the end of the headers in case of broken # emails that don't use the proper line separators. if re.search( r'\n\r?\n', response ): # Break off the first line, which will be '+OK'. ok, message = response.split( '\n', 1 ) # Now find the spam probability and add the header. prob = self.bayes.spamprob( tokenizer.tokenize( message ) ) headers, body = re.split( r'\n\r?\n', response, 1 ) headers = headers + '\r\n' + HEADER_FORMAT % prob + '\r\n' return headers + body else: # Must be an error response. return response def onTop( self, command, args, response ): """Adds the X-Bayes-Score header based on the raw headers and as much of the body as the TOP command retrieves.""" # Easy (but see the caveat in BayesProxy.__doc__). return self.onRetr( command, args, response ) def onUnknown( self, command, args, response ): """Default handler - just returns the server's response verbatim.""" return response def createBayes( pickleName=None, useDB=False ): """Create a GrahamBayes object to score the emails.""" bayes = None if useDB: bayes = hammie.PersistentGrahamBayes( pickleName ) elif pickleName: try: fp = open( pickleName, 'rb' ) except IOError, e: if e.errno <> errno.ENOENT: raise else: print "Loading database...", bayes = cPickle.load( fp ) fp.close() print "Done." if bayes is None: bayes = GrahamBayes() return bayes def main( serverName, serverPort, proxyPort, pickleName, useDB ): """Runs the proxy forever or until a 'KILL' command is received or someone hits Ctrl+Break.""" bayes = createBayes( pickleName, useDB ) BayesProxyListener( serverName, serverPort, proxyPort, bayes ) asyncore.loop() # =================================================================== # Test code. # =================================================================== # One example of spam and one of ham - both are used to train, and are then # classified. Not a good test of the classifier, but a perfectly good test # of the POP3 proxy. The bodies of these came from the spambayes project, # and I added the headers myself because the originals had no headers. spam1 = """From: friend@public.com Subject: Make money fast Hello tim_chandler , Want to save money ? Now is a good time to consider refinancing. Rates are low so you can cut your current payments and save money. http://64.251.22.101/interest/index%38%30%300%2E%68t%6D Take off list on site [s5] """ good1 = """From: chris@example.com Subject: ZPT and DTML Jean Jordaan wrote: > 'Fraid so ;> It contains a vintage dtml-calendar tag. > http://www.zope.org/Members/teyc/CalendarTag > > Hmm I think I see what you mean: one needn't manually pass on the > namespace to a ZPT? Yeah, Page Templates are a bit more clever, sadly, DTML methods aren't :-( Chris """ class TestListener( Listener ): """Listener for TestPOP3Server. Works on port 8110, to co-exist with real POP3 servers.""" def __init__( self, socketMap=asyncore.socket_map ): Listener.__init__( self, 8110, TestPOP3Server, socketMap=socketMap ) class TestPOP3Server( asynchat.async_chat ): """Minimal POP3 server, for testing purposes. Doesn't support TOP or UIDL. USER, PASS, APOP, DELE and RSET simply return "+OK" without doing anything. Also understands the 'KILL' command, to kill it. The mail content is the example messages in classifier.py.""" def __init__( self, clientSocket, socketMap=asyncore.socket_map ): # Grumble: asynchat.__init__ doesn't take a 'map' argument, hence # the two-stage construction. asynchat.async_chat.__init__( self ) asynchat.async_chat.set_socket( self, clientSocket, socketMap ) self.maildrop = [ spam1, good1 ] self.set_terminator( '\r\n' ) self.okCommands = [ 'USER', 'PASS', 'APOP', 'NOOP', 'DELE', 'RSET', 'QUIT', 'KILL' ] self.handlers = { 'STAT': self.onStat, 'LIST': self.onList, 'RETR': self.onRetr } self.push( "+OK ready\r\n" ) self.request = '' def handle_connect( self ): """Suppress the asyncore "unhandled connect event" warning.""" pass def collect_incoming_data( self, data ): """Asynchat override.""" self.request = self.request + data def found_terminator( self ): """Asynchat override.""" if ' ' in self.request: command, args = self.request.split( None, 1 ) else: command, args = self.request, '' command = command.upper() if command in self.okCommands: self.push( "+OK (we hope)\r\n" ) if command == 'QUIT': self.close_when_done() if command == 'KILL': raise SystemExit else: handler = self.handlers.get( command, self.onUnknown ) self.push( handler( command, args ) ) self.request = '' def handle_error( self ): """Let SystemExit cause an exit.""" type, v, t = sys.exc_info() if type == SystemExit: raise else: asynchat.async_chat.handle_error( self ) def onStat( self, command, args ): maildropSize = reduce( operator.add, map( len, self.maildrop ) ) maildropSize += len( self.maildrop ) * len( HEADER_EXAMPLE ) return "+OK %d %d\r\n" % ( len( self.maildrop ), maildropSize ) def onList( self, command, args ): if args: number = int( args ) if 0 < number <= len( self.maildrop ): return "+OK %d\r\n" % len( self.maildrop[ number - 1 ] ) else: return "-ERR no such message\r\n" else: returnLines = [ "+OK" ] for messageIndex in range( len( self.maildrop ) ): size = len( self.maildrop[ messageIndex ] ) returnLines.append( "%d %d" % ( messageIndex + 1, size ) ) returnLines.append( "." ) return '\r\n'.join( returnLines ) + '\r\n' def onRetr( self, command, args ): number = int( args ) if 0 < number <= len( self.maildrop ): message = self.maildrop[ number - 1 ] return "+OK\r\n%s\r\n.\r\n" % message else: return "-ERR no such message\r\n" def onUnknown( self, command, args ): return "-ERR Unknown command: '%s'\r\n" % command def test(): """Runs a self-test using TestPOP3Server, a minimal POP3 server that serves the example emails above.""" # Run a proxy and a test server in separate threads with separate # asyncore environments. import threading testServerReady = threading.Event() def runTestServer(): testSocketMap = {} TestListener( socketMap=testSocketMap ) testServerReady.set() asyncore.loop( map=testSocketMap ) def runProxy(): bayes = createBayes() BayesProxyListener( 'localhost', 8110, 8111, bayes ) bayes.learn( tokenizer.tokenize( spam1 ), True ) bayes.learn( tokenizer.tokenize( good1 ), False ) asyncore.loop() threading.Thread( target=runTestServer ).start() testServerReady.wait() threading.Thread( target=runProxy ).start() # Connect to the proxy. proxy = socket.socket( socket.AF_INET, socket.SOCK_STREAM ) proxy.connect( ( 'localhost', 8111 ) ) assert proxy.recv( 100 ) == "+OK ready\r\n" # Stat the mailbox to get the number of messages. proxy.send( "stat\r\n" ) response = proxy.recv( 100 ) count, totalSize = map( int, response.split()[ 1:3 ] ) print "%d messages in test mailbox" % count assert count == 2 # Loop through the messages ensuring that they have X-Bayes-Score # headers. for i in range( 1, count+1 ): response = "" proxy.send( "retr %d\r\n" % i ) while response.find( '\n.\r\n' ) == -1: response = response + proxy.recv( 1000 ) headerOffset = response.find( 'X-Bayes-Score' ) assert headerOffset != -1 headerEnd = headerOffset + len( HEADER_EXAMPLE ) header = response[ headerOffset:headerEnd ].strip() print "Message %d: %s" % ( i, header ) # Kill the proxy and the test server. proxy.sendall( "kill\r\n" ) server = socket.socket( socket.AF_INET, socket.SOCK_STREAM ) server.connect( ( 'localhost', 8110 ) ) server.sendall( "kill\r\n" ) # =================================================================== # __main__ driver. # =================================================================== if __name__ == '__main__': # Read the arguments. try: opts, args = getopt.getopt( sys.argv[ 1: ], 'htdp:' ) except getopt.error, msg: print >>sys.stderr, str( msg ) + '\n\n' + __doc__ sys.exit() pickleName = hammie.DEFAULTDB useDB = False runTestServer = False for opt, arg in opts: if opt == '-h': print >>sys.stderr, __doc__ sys.exit() elif opt == '-t': runTestServer = True elif opt == '-d': useDB = True elif opt == '-p': pickleName = arg # Do whatever we've been asked to do... if not opts and not args: print "Running a self-test (use 'pop3proxy -h' for help)" test() print "Self-test passed." # ...else it would have asserted. elif runTestServer: print "Running a test POP3 server on port 8110..." TestListener() asyncore.loop() elif len( args ) == 1: # Named POP3 server, default port. main( args[ 0 ], 110, 110, pickleName, useDB ) elif len( args ) == 2: # Named POP3 server, named port. main( args[ 0 ], int( args[ 1 ] ), 110, pickleName, useDB ) else: print >>sys.stderr, __doc__ From montanaro@users.sourceforge.net Mon Sep 16 18:28:48 2002 From: montanaro@users.sourceforge.net (Skip Montanaro) Date: Mon, 16 Sep 2002 10:28:48 -0700 Subject: [Spambayes-checkins] spambayes loosecksum.py,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv25859 Modified Files: loosecksum.py Log Message: fix typo Index: loosecksum.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/loosecksum.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** loosecksum.py 9 Sep 2002 19:23:18 -0000 1.1 --- loosecksum.py 16 Sep 2002 17:28:45 -0000 1.2 *************** *** 79,84 **** return flatten(obj.get_payload()) if isinstance(obj, list): ! return "\n".join([flatten(b) for b in body]) ! raise TypeError, ("unrecognized body type: %s" % type(body)) def generate_checksum(f): --- 79,84 ---- return flatten(obj.get_payload()) if isinstance(obj, list): ! return "\n".join([flatten(b) for b in obj]) ! raise TypeError, ("unrecognized body type: %s" % type(obj)) def generate_checksum(f): From rubiconx@users.sourceforge.net Tue Sep 17 05:49:18 2002 From: rubiconx@users.sourceforge.net (Neale Pickett) Date: Mon, 16 Sep 2002 21:49:18 -0700 Subject: [Spambayes-checkins] spambayes runtest.sh,NONE,1.1 README.txt,1.18,1.19 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv26378 Modified Files: README.txt Added Files: runtest.sh Log Message: Added the runtest.sh script, which is supposed to make it easier for rubes like myself to submit useful test results. --- NEW FILE: runtest.sh --- #! /bin/sh -x ## ## runtest.sh -- run some tests for Tim ## ## This does everything you need to test yer data. You may want to skip ## the rebal steps if you've recently moved some of your messages ## (because they were in the wrong corpus) or you may suffer my fate and ## get stuck forever re-categorizing email. ## ## Just set up your messages as detailed in README.txt; put them all in ## the reservoir directories, and this script will take care of the ## rest. Paste the output (also in results.txt) to the mailing list for ## good karma. ## ## Neale Pickett ## # Number of messages per rebalanced set RNUM=200 # Number of sets SETS=5 # Put them all into reservoirs python rebal.py -r Data/Ham/reservoir -s Data/Ham/Set -n 0 -Q python rebal.py -r Data/Spam/reservoir -s Data/Spam/Set -n 0 -Q # Rebalance python rebal.py -r Data/Ham/reservoir -s Data/Ham/Set -n $RNUM -Q python rebal.py -r Data/Spam/reservoir -s Data/Spam/Set -n $RNUM -Q # Clear out .ini file rm -f bayescustomize.ini # Run 1 python timcv.py -n $SETS > run1.txt # New .ini file cat > bayescustomize.ini < run2.txt # Generate rates python rates.py run1 run2 > runrates.txt # Compare rates python cmp.py run1s run2s | tee results.txt Index: README.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/README.txt,v retrieving revision 1.18 retrieving revision 1.19 diff -C2 -d -r1.18 -r1.19 *** README.txt 14 Sep 2002 22:18:24 -0000 1.18 --- README.txt 17 Sep 2002 04:49:16 -0000 1.19 *************** *** 124,127 **** --- 124,134 ---- the number of messages per folder. + runtest.sh + A bourne shell script (for Unix) which will run some test or other. + I (Neale) will try to keep this updated to test whatever Tim is + currently asking for. The idea is, if you have a standard directory + structure (below), you can run this thing, go have some tea while it + works, then paste the output to the spambayes list for good karma. + Standard Test Data Setup From jhylton@users.sourceforge.net Tue Sep 17 16:29:48 2002 From: jhylton@users.sourceforge.net (Jeremy Hylton) Date: Tue, 17 Sep 2002 08:29:48 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.15,1.16 mboxtest.py,1.5,1.6 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv29116 Modified Files: Options.py mboxtest.py Log Message: Add three options for MboxTest. Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.15 retrieving revision 1.16 diff -C2 -d -r1.15 -r1.16 *** Options.py 15 Sep 2002 00:01:48 -0000 1.15 --- Options.py 17 Sep 2002 15:29:45 -0000 1.16 *************** *** 68,71 **** --- 68,86 ---- mine_received_headers: False + [MboxTest] + # If tokenize_header_words is true, then the header values are + # tokenized using the default text tokenize. The words are tagged + # with "header:" where header is the name of the header. + tokenize_header_words: False + # If tokenize_header_default is True, use the base header tokenization + # logic described in the Tokenizer section. + tokenize_header_default: True + + # skip_headers is a set of regular expressions describing headers that + # should not be tokenized if tokenize_header is True. + skip_headers: received + date + x-.* + [TestDriver] # These control various displays in class TestDriver.Driver. *************** *** 158,161 **** --- 173,180 ---- 'adjust_probs_by_evidence_mass': boolean_cracker, }, + 'MboxTest': {'tokenize_header_words': boolean_cracker, + 'tokenize_header_default': boolean_cracker, + 'skip_headers': ('get', lambda s: Set(s.split())), + }, } Index: mboxtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/mboxtest.py,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** mboxtest.py 14 Sep 2002 00:03:51 -0000 1.5 --- mboxtest.py 17 Sep 2002 15:29:45 -0000 1.6 *************** *** 22,25 **** --- 22,26 ---- import mailbox import random + import re from sets import Set import sys *************** *** 28,31 **** --- 29,33 ---- from TestDriver import Driver from timtest import Msg + from Options import options mbox_fmts = {"unix": mailbox.PortableUnixMailbox, *************** *** 37,53 **** class MyTokenizer(Tokenizer): ! skip = {'received': 1, ! 'date': 1, ! 'x-from_': 1, ! } def tokenize_headers(self, msg): ! for k, v in msg.items(): ! k = k.lower() ! if k in self.skip or k.startswith('x-vm'): ! continue ! for w in subject_word_re.findall(v): ! for t in tokenize_word(w): ! yield "%s:%s" % (k, t) class MboxMsg(Msg): --- 39,57 ---- class MyTokenizer(Tokenizer): ! skip = [re.compile(rx) for rx in options.skip_headers] def tokenize_headers(self, msg): ! if options.tokenize_header_words: ! for k, v in msg.items(): ! k = k.lower() ! for rx in self.skip: ! if rx.match(k): ! continue ! for w in subject_word_re.findall(v): ! for t in tokenize_word(w): ! yield "%s:%s" % (k, t) ! if options.tokenize_header_default: ! for tok in Tokenizer.tokenize_headers(self, msg): ! yield tok class MboxMsg(Msg): *************** *** 74,81 **** return "\n".join(lines) ! ## tokenize = MyTokenizer().tokenize def __iter__(self): ! return tokenize(self.guts) class mbox(object): --- 78,85 ---- return "\n".join(lines) ! tokenize = MyTokenizer().tokenize def __iter__(self): ! return self.tokenize(self.guts) class mbox(object): *************** *** 130,134 **** FMT = "unix" ! NSETS = 5 SEED = 101 MAXMSGS = None --- 134,138 ---- FMT = "unix" ! NSETS = 10 SEED = 101 MAXMSGS = None *************** *** 158,176 **** print "spam", spam, nspam ! testsets = [] ! for iham in randindices(nham, NSETS): ! for ispam in randindices(nspam, NSETS): ! testsets.append((sort(iham), sort(ispam))) driver = Driver() ! for iham, ispam in testsets: ! driver.new_classifier() ! driver.train(mbox(ham, iham), mbox(spam, ispam)) ! for ihtest, istest in testsets: ! if (iham, ispam) == (ihtest, istest): ! continue ! driver.test(mbox(ham, ihtest), mbox(spam, istest)) driver.finishtest() driver.alldone() --- 162,188 ---- print "spam", spam, nspam ! ihams = map(tuple, randindices(nham, NSETS)) ! ispams = map(tuple, randindices(nspam, NSETS)) driver = Driver() ! for i in range(1, NSETS): ! driver.train(mbox(ham, ihams[i]), mbox(spam, ispams[i])) ! ! i = 0 ! for iham, ispam in zip(ihams, ispams): ! hams = mbox(ham, iham) ! spams = mbox(spam, ispam) ! ! if i > 0: ! driver.untrain(hams, spams) ! ! driver.test(hams, spams) driver.finishtest() + + if i < NSETS - 1: + driver.train(hams, spams) + + i += 1 driver.alldone() From jhylton@users.sourceforge.net Tue Sep 17 18:57:42 2002 From: jhylton@users.sourceforge.net (Jeremy Hylton) Date: Tue, 17 Sep 2002 10:57:42 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.16,1.17 mboxtest.py,1.6,1.7 tokenizer.py,1.22,1.23 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv22911 Modified Files: Options.py mboxtest.py tokenizer.py Log Message: Merge the simple tokenizer from mboxtest.MyTokenizer into the default tokenizer, controlled by the basic_header_tokenize options. This gives good results for my ham/spam collection, cutting the number of false negatives in half without changing the total number of false positives. false positive percentages 3.030 1.527 won -49.60% 0.758 3.053 lost +302.77% 3.030 1.527 won -49.60% 1.515 1.527 lost +0.79% 0.758 0.000 won -100.00% 1.515 2.290 lost +51.16% 1.515 1.527 lost +0.79% 3.030 2.290 won -24.42% 0.758 0.763 lost +0.66% 0.000 1.527 lost +(was 0) won 4 times tied 0 times lost 6 times total unique fp went from 21 to 21 tied mean fp % went from 1.59090909091 to 1.60305343511 lost +0.76% false negative percentages 4.511 4.511 tied 9.023 3.759 won -58.34% 8.271 3.759 won -54.55% 9.023 5.263 won -41.67% 7.519 2.256 won -70.00% 8.271 3.759 won -54.55% 9.774 4.511 won -53.85% 5.263 3.759 won -28.58% 4.511 3.759 won -16.67% 3.759 3.759 tied won 8 times tied 2 times lost 0 times total unique fn went from 93 to 52 won -44.09% mean fn % went from 6.99248120301 to 3.90977443609 won -44.09% Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.16 retrieving revision 1.17 diff -C2 -d -r1.16 -r1.17 *** Options.py 17 Sep 2002 15:29:45 -0000 1.16 --- Options.py 17 Sep 2002 17:57:39 -0000 1.17 *************** *** 13,16 **** --- 13,36 ---- defaults = """ [Tokenizer] + # If true, tokenizer.Tokenizer.tokenize_headers() will tokenize the + # contents of each header field just like the text of the message + # body, using the name of the header as a tag. Tokens look like + # "header:word". The basic approach is simple and effective, but also + # very sensitive to biases in the ham and spam collections. For + # example, if the ham and spam were collected at different times, + # several headers with date/time information will become the best + # discriminators. (Not just Date, but Received and X-From_.) + basic_header_tokenize: False + + # If true and basic_header_tokenize is also true, then + # basic_header_tokenize is the only action performed. + basic_header_tokenize_only: False + + # If basic_header_tokenize is true, then basic_header_skip is a set of + # headers that should be skipped. + basic_header_skip: received + date + x-.* + # If false, tokenizer.Tokenizer.tokenize_body() strips HTML tags # from pure text/html messages. Set true to retain HTML tags in this *************** *** 68,86 **** mine_received_headers: False - [MboxTest] - # If tokenize_header_words is true, then the header values are - # tokenized using the default text tokenize. The words are tagged - # with "header:" where header is the name of the header. - tokenize_header_words: False - # If tokenize_header_default is True, use the base header tokenization - # logic described in the Tokenizer section. - tokenize_header_default: True - - # skip_headers is a set of regular expressions describing headers that - # should not be tokenized if tokenize_header is True. - skip_headers: received - date - x-.* - [TestDriver] # These control various displays in class TestDriver.Driver. --- 88,91 ---- *************** *** 151,154 **** --- 156,162 ---- 'count_all_header_lines': boolean_cracker, 'mine_received_headers': boolean_cracker, + 'basic_header_tokenize': boolean_cracker, + 'basic_header_tokenize_only': boolean_cracker, + 'basic_header_skip': ('get', lambda s: Set(s.split())), }, 'TestDriver': {'nbuckets': int_cracker, *************** *** 173,180 **** 'adjust_probs_by_evidence_mass': boolean_cracker, }, - 'MboxTest': {'tokenize_header_words': boolean_cracker, - 'tokenize_header_default': boolean_cracker, - 'skip_headers': ('get', lambda s: Set(s.split())), - }, } --- 181,184 ---- *************** *** 222,226 **** return output.getvalue() - options = OptionsClass() --- 226,229 ---- *************** *** 230,231 **** --- 233,235 ---- options.mergefiles(['bayescustomize.ini']) + Index: mboxtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/mboxtest.py,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** mboxtest.py 17 Sep 2002 15:29:45 -0000 1.6 --- mboxtest.py 17 Sep 2002 17:57:39 -0000 1.7 *************** *** 26,30 **** import sys ! from tokenizer import Tokenizer, subject_word_re, tokenize_word, tokenize from TestDriver import Driver from timtest import Msg --- 26,30 ---- import sys ! from tokenizer import tokenize from TestDriver import Driver from timtest import Msg *************** *** 37,58 **** } - class MyTokenizer(Tokenizer): - - skip = [re.compile(rx) for rx in options.skip_headers] - - def tokenize_headers(self, msg): - if options.tokenize_header_words: - for k, v in msg.items(): - k = k.lower() - for rx in self.skip: - if rx.match(k): - continue - for w in subject_word_re.findall(v): - for t in tokenize_word(w): - yield "%s:%s" % (k, t) - if options.tokenize_header_default: - for tok in Tokenizer.tokenize_headers(self, msg): - yield tok - class MboxMsg(Msg): --- 37,40 ---- *************** *** 78,85 **** return "\n".join(lines) - tokenize = MyTokenizer().tokenize - def __iter__(self): ! return self.tokenize(self.guts) class mbox(object): --- 60,65 ---- return "\n".join(lines) def __iter__(self): ! return tokenize(self.guts) class mbox(object): *************** *** 132,135 **** --- 112,117 ---- def main(args): global FMT + + print options.display() FMT = "unix" Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.22 retrieving revision 1.23 diff -C2 -d -r1.22 -r1.23 *** tokenizer.py 14 Sep 2002 20:08:07 -0000 1.22 --- tokenizer.py 17 Sep 2002 17:57:39 -0000 1.23 *************** *** 829,832 **** --- 829,837 ---- class Tokenizer: + def __init__(self): + if options.basic_header_tokenize: + self.basic_skip = [re.compile(s) + for s in options.basic_header_skip] + def get_message(self, obj): if isinstance(obj, email.Message.Message): *************** *** 857,860 **** --- 862,890 ---- # Special tagging of header lines. + # Basic header tokenization + # Tokenize the contents of each header field just like the + # text of the message body, using the name of the header as a + # tag. Tokens look like "header:word". The basic approach is + # simple and effective, but also very sensitive to biases in + # the ham and spam collections. For example, if the ham and + # spam were collected at different times, several headers with + # date/time information will become the best discriminators. + # (Not just Date, but Received and X-From_.) + if options.basic_header_tokenize: + for k, v in msg.items(): + k = k.lower() + match = False + for rx in self.basic_skip: + if rx.match(k) is not None: + match = True + continue + if match: + continue + for w in subject_word_re.findall(v): + for t in tokenize_word(w): + yield "%s:%s" % (k, t) + if options.basic_header_tokenize_only: + return + # XXX TODO Neil Schemenauer has gotten a good start on this # XXX (pvt email). The headers in my spam and ham corpora are *************** *** 863,868 **** # XXX some "safe" header lines are included here, where "safe" # XXX is specific to my sorry corpora. - # XXX Jeremy Hylton also reported good results from the general - # XXX header-mining in mboxtest.MyTokenizer.tokenize_headers. # Content-{Type, Disposition} and their params, and charsets. --- 893,896 ---- From tim_one@users.sourceforge.net Wed Sep 18 02:42:00 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Tue, 17 Sep 2002 18:42:00 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.17,1.18 classifier.py,1.11,1.12 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv8672 Modified Files: Options.py classifier.py Log Message: adjust_probs_by_evidence_mass is history -- the reported results weren't strong and consistent enough to justify keeping it. Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.17 retrieving revision 1.18 diff -C2 -d -r1.17 -r1.18 *** Options.py 17 Sep 2002 17:57:39 -0000 1.17 --- Options.py 18 Sep 2002 01:41:58 -0000 1.18 *************** *** 139,146 **** max_discriminators: 16 - - # Speculative change to allow giving probabilities more weight the more - # messages went into computing them. - adjust_probs_by_evidence_mass: False """ --- 139,142 ---- Index: classifier.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/classifier.py,v retrieving revision 1.11 retrieving revision 1.12 diff -C2 -d -r1.11 -r1.12 *** classifier.py 15 Sep 2002 07:45:31 -0000 1.11 --- classifier.py 18 Sep 2002 01:41:58 -0000 1.12 *************** *** 547,551 **** nham = float(self.nham or 1) nspam = float(self.nspam or 1) - fiddle = options.adjust_probs_by_evidence_mass for word,record in self.wordinfo.iteritems(): # Compute prob(msg is spam | msg contains word). --- 547,550 ---- *************** *** 560,580 **** elif prob > MAX_SPAMPROB: prob = MAX_SPAMPROB - - if fiddle: - # Suppose two clues have spamprob 0.99. Which one is better? - # One reasonable guess is that it's the one derived from the - # most data. This code fiddles non-0.5 probabilities by - # shrinking their distance to 0.5, but shrinking less the - # more evidence went into computing them. Note that if this - # proves to work, it should allow getting rid of the - # "cancelling evidence" complications in spamprob() - # (two probs exactly the same distance from 0.5 are far - # less common after this transformation; instead, spamprob() - # will pick up on the clues with the most evidence backing - # them up). - dist = prob - 0.5 - sum = hamcount + spamcount - dist *= sum / (sum + 0.1) - prob = 0.5 + dist if record.spamprob != prob: --- 559,562 ---- From rubiconx@users.sourceforge.net Wed Sep 18 18:44:27 2002 From: rubiconx@users.sourceforge.net (Neale Pickett) Date: Wed, 18 Sep 2002 10:44:27 -0700 Subject: [Spambayes-checkins] spambayes runtest.sh,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv22127 Modified Files: runtest.sh Log Message: * Modified runtest.sh for Tim's request to test Robinson's changes. Index: runtest.sh =================================================================== RCS file: /cvsroot/spambayes/spambayes/runtest.sh,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** runtest.sh 17 Sep 2002 04:49:16 -0000 1.1 --- runtest.sh 18 Sep 2002 17:44:25 -0000 1.2 *************** *** 16,19 **** --- 16,22 ---- ## + # Test to run + TEST=${1:-robinson1} + # Number of messages per rebalanced set RNUM=200 *************** *** 28,37 **** python rebal.py -r Data/Ham/reservoir -s Data/Ham/Set -n $RNUM -Q python rebal.py -r Data/Spam/reservoir -s Data/Spam/Set -n $RNUM -Q ! # Clear out .ini file ! rm -f bayescustomize.ini ! # Run 1 ! python timcv.py -n $SETS > run1.txt ! # New .ini file ! cat > bayescustomize.ini < ! ! python timcv.py -n $SETS > run1.txt ! ! mv Tester.py Tester.py.orig ! cp Tester.py.new Tester.py ! mv classifier.py classifier.py.orig ! cp classifier.py.new classifier.py ! python timcv.py -n $SETS > run2.txt ! ! python rates.py run1 run2 > runrates.txt ! ! python cmp.py run1s run2s | tee results.txt ! ! mv Tester.py.orig Tester.py ! mv classifier.py.orig classifier.py ! ;; ! mass) ! ## Tim took this code out, don't run this test. I'm leaving ! ## this stuff in here for the time being so I can refer to it ! ## later when I need to do this sort of thing again :) ! ! # Clear out .ini file ! rm -f bayescustomize.ini ! # Run 1 ! python timcv.py -n $SETS > run1.txt ! # New .ini file ! cat > bayescustomize.ini < run2.txt ! # Generate rates ! python rates.py run1 run2 > runrates.txt ! # Compare rates ! python cmp.py run1s run2s | tee results.txt --- 70,79 ---- hambias: 1.5 EOF ! # Run 2 ! python timcv.py -n $SETS > run2.txt ! # Generate rates ! python rates.py run1 run2 > runrates.txt ! # Compare rates ! python cmp.py run1s run2s | tee results.txt ! ;; ! esac From richiehindle@users.sourceforge.net Wed Sep 18 23:01:42 2002 From: richiehindle@users.sourceforge.net (Richie Hindle) Date: Wed, 18 Sep 2002 15:01:42 -0700 Subject: [Spambayes-checkins] spambayes README.txt,1.19,1.20 pop3proxy.py,1.1,1.2 hammie.py,1.16,1.17 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv20824 Modified Files: README.txt pop3proxy.py hammie.py Log Message: Added SPAM_THRESHOLD and createbayes() to hammie, so that pop3proxy can use them. Made pop3proxy add simple X-Hammie-Disposition headers raher than using its own header format. Made pop3proxy.py obey the Python style guide. Removed the copyright and license from pop3proxy,py - I've assigned copyright to the PSF. Index: README.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/README.txt,v retrieving revision 1.19 retrieving revision 1.20 diff -C2 -d -r1.19 -r1.20 *** README.txt 17 Sep 2002 04:49:16 -0000 1.19 --- README.txt 18 Sep 2002 22:01:39 -0000 1.20 *************** *** 60,63 **** --- 60,69 ---- Needs to be made faster, especially for writes. + pop3proxy.py + A spam-classifying POP3 proxy. It adds a spam-judgement header to + each mail as it's retrieved, so you can use your email client's + filters to deal with them without needing to fiddle with your email + delivery system. + Concrete Test Drivers Index: pop3proxy.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** pop3proxy.py 16 Sep 2002 07:57:20 -0000 1.1 --- pop3proxy.py 18 Sep 2002 22:01:39 -0000 1.2 *************** *** 1,31 **** #!/usr/bin/env python ! # pop3proxy is released under the terms of the following MIT-style license: ! # ! # Copyright (c) Entrian Solutions 2002 ! # ! # Permission is hereby granted, free of charge, to any person obtaining a ! # copy of this software and associated documentation files (the "Software"), ! # to deal in the Software without restriction, including without limitation ! # the rights to use, copy, modify, merge, publish, distribute, sublicense, [...1035 lines suppressed...] # Named POP3 server, default port. ! main( args[ 0 ], 110, 110, pickleName, useDB ) ! elif len( args ) == 2: # Named POP3 server, named port. ! main( args[ 0 ], int( args[ 1 ] ), 110, pickleName, useDB ) else: --- 571,581 ---- asyncore.loop() ! elif len(args) == 1: # Named POP3 server, default port. ! main(args[0], 110, 110, pickleName, useDB) ! elif len(args) == 2: # Named POP3 server, named port. ! main(args[0], int(args[1]), 110, pickleName, useDB) else: Index: hammie.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammie.py,v retrieving revision 1.16 retrieving revision 1.17 diff -C2 -d -r1.16 -r1.17 *** hammie.py 12 Sep 2002 05:10:02 -0000 1.16 --- hammie.py 18 Sep 2002 22:01:39 -0000 1.17 *************** *** 47,50 **** --- 47,53 ---- DEFAULTDB = "hammie.db" + # Probability at which a message is considered spam + SPAM_THRESHOLD = 0.9 + # Tim's tokenizer kicks far more booty than anything I would have # written. Score one for analysis ;) *************** *** 232,236 **** msg = email.message_from_file(input) prob, clues = bayes.spamprob(tokenize(msg), True) ! if prob < 0.9: disp = "No" else: --- 235,239 ---- msg = email.message_from_file(input) prob, clues = bayes.spamprob(tokenize(msg), True) ! if prob < SPAM_THRESHOLD: disp = "No" else: *************** *** 250,254 **** i += 1 prob, clues = bayes.spamprob(tokenize(msg), True) ! isspam = prob >= 0.9 if hasattr(msg, '_mh_msgno'): msgno = msg._mh_msgno --- 253,257 ---- i += 1 prob, clues = bayes.spamprob(tokenize(msg), True) ! isspam = prob >= SPAM_THRESHOLD if hasattr(msg, '_mh_msgno'): msgno = msg._mh_msgno *************** *** 263,266 **** --- 266,288 ---- print "Total %d spam, %d ham" % (spams, hams) + def createbayes(pck=DEFAULTDB, usedb=False): + """Create a GrahamBayes instance for the given pickle (which + doesn't have to exist). Create a PersistentGrahamBayes if + usedb is True.""" + if usedb: + bayes = PersistentGrahamBayes(pck) + else: + bayes = None + try: + fp = open(pck, 'rb') + except IOError, e: + if e.errno <> errno.ENOENT: raise + else: + bayes = pickle.load(fp) + fp.close() + if bayes is None: + bayes = classifier.GrahamBayes() + return bayes + def usage(code, msg=''): """Print usage message and sys.exit(code).""" *************** *** 304,320 **** save = False ! if usedb: ! bayes = PersistentGrahamBayes(pck) ! else: ! bayes = None ! try: ! fp = open(pck, 'rb') ! except IOError, e: ! if e.errno <> errno.ENOENT: raise ! else: ! bayes = pickle.load(fp) ! fp.close() ! if bayes is None: ! bayes = classifier.GrahamBayes() if good: --- 326,330 ---- save = False ! bayes = createbayes(pck, usedb) if good: From rubiconx@users.sourceforge.net Thu Sep 19 01:17:43 2002 From: rubiconx@users.sourceforge.net (Neale Pickett) Date: Wed, 18 Sep 2002 17:17:43 -0700 Subject: [Spambayes-checkins] spambayes hammiesrv.py,NONE,1.1 runtest.sh,1.2,1.3 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv26022 Modified Files: runtest.sh Added Files: hammiesrv.py Log Message: * runtest now supports targets and a -r option to force re-rebal-ing * new hammiesrv, with a nice clean Hammie class I probably ought to move to hammie.py before anything new code imports it --- NEW FILE: hammiesrv.py --- #! /usr/bin/env python # A server version of hammie.py # Server code import SimpleXMLRPCServer import email import hammie from tokenizer import tokenize # Default header to add DFL_HEADER = "X-Hammie-Disposition" # Default spam cutoff DFL_CUTOFF = 0.9 class Hammie: def __init__(self, bayes): self.bayes = bayes def _scoremsg(self, msg, evidence=False): """Score an email.Message. Returns the probability the message is spam. If evidence is true, returns a tuple: (probability, clues), where clues is a list of the words which contributed to the score. """ return self.bayes.spamprob(tokenize(msg), evidence) def score(self, msg, evidence=False): """Score (judge) a message. Pass in a message as a string. Returns the probability the message is spam. If evidence is true, returns a tuple: (probability, clues), where clues is a list of the words which contributed to the score. """ return self._scoremsg(email.message_from_string(msg), evidence) def filter(self, msg, header=DFL_HEADER, cutoff=DFL_CUTOFF): """Score (judge) a message and add a disposition header. Pass in a message as a string. Optionally, set header to the name of the header to add, and/or cutoff to the probability value which must be met or exceeded for a message to get a 'Yes' disposition. Returns the same message with a new disposition header. """ msg = email.message_from_string(msg) prob, clues = self._scoremsg(msg, True) if prob < cutoff: disp = "No" else: disp = "Yes" disp += "; %.2f" % prob disp += "; " + hammie.formatclues(clues) msg.add_header(header, disp) return msg.as_string(unixfrom=(msg.get_unixfrom() is not None)) def train(self, msg, is_spam): """Train bayes with a message. msg should be the message as a string, and is_spam should be 1 if the message is spam, 0 if not. Probabilities are not updated after this call is made; to do that, call update_probabilities(). """ self.bayes.learn(tokenize(msg), is_spam, False) def train_ham(self, msg): """Train bayes with ham. msg should be the message as a string. Probabilities are not updated after this call is made; to do that, call update_probabilities(). """ self.train(msg, False) def train_spam(self, msg): """Train bayes with spam. msg should be the message as a string. Probabilities are not updated after this call is made; to do that, call update_probabilities(). """ self.train(msg, True) def update_probabilities(self): """Update probability values. You would want to call this after a training session. It's pretty slow, so if you have a lot of messages to train, wait until you're all done before calling this. """ self.bayes.update_probabilites() def main(): usedb = True pck = "/home/neale/lib/hammie.db" if usedb: bayes = hammie.PersistentGrahamBayes(pck) else: bayes = None try: fp = open(pck, 'rb') except IOError, e: if e.errno <> errno.ENOENT: raise else: bayes = pickle.load(fp) fp.close() if bayes is None: import classifier bayes = classifier.GrahamBayes() server = SimpleXMLRPCServer.SimpleXMLRPCServer(("localhost", 7732)) server.register_instance(Hammie(bayes)) server.serve_forever() if __name__ == "__main__": main() Index: runtest.sh =================================================================== RCS file: /cvsroot/spambayes/spambayes/runtest.sh,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** runtest.sh 18 Sep 2002 17:44:25 -0000 1.2 --- runtest.sh 19 Sep 2002 00:17:41 -0000 1.3 *************** *** 16,20 **** ## ! # Test to run TEST=${1:-robinson1} --- 16,25 ---- ## ! if [ "$1" = "-r" ]; then ! REBAL=1 ! shift ! fi ! ! # Which test to run TEST=${1:-robinson1} *************** *** 25,36 **** SETS=5 ! # Put them all into reservoirs ! python rebal.py -r Data/Ham/reservoir -s Data/Ham/Set -n 0 -Q ! python rebal.py -r Data/Spam/reservoir -s Data/Spam/Set -n 0 -Q ! # Rebalance ! python rebal.py -r Data/Ham/reservoir -s Data/Ham/Set -n $RNUM -Q ! python rebal.py -r Data/Spam/reservoir -s Data/Spam/Set -n $RNUM -Q case "$TEST" in robinson1) # This test requires you have an appropriately-modified --- 30,50 ---- SETS=5 ! if [ -n "$REBAL" ]; then ! # Put them all into reservoirs ! python rebal.py -r Data/Ham/reservoir -s Data/Ham/Set -n 0 -Q ! python rebal.py -r Data/Spam/reservoir -s Data/Spam/Set -n 0 -Q ! # Rebalance ! python rebal.py -r Data/Ham/reservoir -s Data/Ham/Set -n $RNUM -Q ! python rebal.py -r Data/Spam/reservoir -s Data/Spam/Set -n $RNUM -Q ! fi case "$TEST" in + run2|useold) + python timcv.py -n $SETS > run2.txt + + python rates.py run1 run2 > runrates.txt + + python cmp.py run1s run2s | tee results.txt + ;; robinson1) # This test requires you have an appropriately-modified From tim_one@users.sourceforge.net Thu Sep 19 07:30:18 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Wed, 18 Sep 2002 23:30:18 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.18,1.19 Tester.py,1.3,1.4 classifier.py,1.12,1.13 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv25375 Modified Files: Options.py Tester.py classifier.py Log Message: Making it easy to try Gary Robinson's probability combining scheme. Just set: [Classifier] use_robinson_probability: True [TestDriver] spam_cutoff: 0.50 as a pair. Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.18 retrieving revision 1.19 diff -C2 -d -r1.18 -r1.19 *** Options.py 18 Sep 2002 01:41:58 -0000 1.18 --- Options.py 19 Sep 2002 06:30:15 -0000 1.19 *************** *** 89,93 **** [TestDriver] ! # These control various displays in class TestDriver.Driver. # Number of buckets in histograms. --- 89,103 ---- [TestDriver] ! # These control various displays in class TestDriver.Driver, and Tester.Test. ! ! # A message is considered spam iff it scores greater than spam_cutoff. ! # If using Graham's combining scheme, 0.90 seems to work best for "small" ! # training sets. As the size of the training sets increase, there's not ! # yet any bound in sight for how low this can go (0.075 would work as ! # well as 0.90 on Tim's large c.l.py data). ! # For Gary Robinson's scheme, 0.50 works best for *us*. Other people ! # who have implemented Graham's scheme, and stuck to it in most respects, ! # report values closer to 0.70 working best for them. ! spam_cutoff: 0.90 # Number of buckets in histograms. *************** *** 139,142 **** --- 149,155 ---- max_discriminators: 16 + + # Use Gary Robinson's scheme for combining probabilities. + use_robinson_probability: False """ *************** *** 168,171 **** --- 181,185 ---- 'pickle_basename': string_cracker, 'show_charlimit': int_cracker, + 'spam_cutoff': float_cracker, }, 'Classifier': {'hambias': float_cracker, *************** *** 175,179 **** 'unknown_spamprob': float_cracker, 'max_discriminators': int_cracker, ! 'adjust_probs_by_evidence_mass': boolean_cracker, }, } --- 189,193 ---- 'unknown_spamprob': float_cracker, 'max_discriminators': int_cracker, ! 'use_robinson_probability': boolean_cracker, }, } Index: Tester.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Tester.py,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** Tester.py 13 Sep 2002 17:49:02 -0000 1.3 --- Tester.py 19 Sep 2002 06:30:15 -0000 1.4 *************** *** 1,2 **** --- 1,4 ---- + from Options import options + class Test: # Pass a classifier instance (an instance of GrahamBayes). *************** *** 83,87 **** if callback: callback(example, prob) ! is_spam_guessed = prob > 0.90 correct = is_spam_guessed == is_spam if is_spam: --- 85,89 ---- if callback: callback(example, prob) ! is_spam_guessed = prob > options.spam_cutoff correct = is_spam_guessed == is_spam if is_spam: Index: classifier.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/classifier.py,v retrieving revision 1.12 retrieving revision 1.13 diff -C2 -d -r1.12 -r1.13 *** classifier.py 18 Sep 2002 01:41:58 -0000 1.12 --- classifier.py 19 Sep 2002 06:30:15 -0000 1.13 *************** *** 314,329 **** heapreplace(nbest, x) ! prob_product = inverse_prob_product = 1.0 ! for distance, prob, word, record in nbest: ! if prob is None: # it's one of the dummies nbest started with ! continue ! if record is not None: # else wordinfo doesn't know about it ! record.killcount += 1 ! if evidence: ! clues.append((word, prob)) ! prob_product *= prob ! inverse_prob_product *= 1.0 - prob - prob = prob_product / (prob_product + inverse_prob_product) if evidence: clues.sort(lambda a, b: cmp(a[1], b[1])) --- 314,358 ---- heapreplace(nbest, x) ! if options.use_robinson_probability: ! # This combination method is due to Gary Robinson. ! # http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html ! # In preliminary tests, it did just as well as Graham's scheme, ! # but creates a definite "middle ground" around 0.5 where false ! # negatives and false positives can actually found in non-trivial ! # number. ! P = Q = 1.0 ! num_clues = 0 ! for distance, prob, word, record in nbest: ! if prob is None: # it's one of the dummies nbest started with ! continue ! if record is not None: # else wordinfo doesn't know about it ! record.killcount += 1 ! if evidence: ! clues.append((word, prob)) ! num_clues += 1 ! P *= 1.0 - prob ! Q *= prob ! ! if num_clues: ! P = 1.0 - P**(1./num_clues) ! Q = 1.0 - Q**(1./num_clues) ! prob = (P-Q)/(P+Q) # in -1 .. 1 ! prob = 0.5 + prob/2 # shift to 0 .. 1 ! else: ! prob = 0.5 ! else: ! prob_product = inverse_prob_product = 1.0 ! for distance, prob, word, record in nbest: ! if prob is None: # it's one of the dummies nbest started with ! continue ! if record is not None: # else wordinfo doesn't know about it ! record.killcount += 1 ! if evidence: ! clues.append((word, prob)) ! prob_product *= prob ! inverse_prob_product *= 1.0 - prob ! ! prob = prob_product / (prob_product + inverse_prob_product) if evidence: clues.sort(lambda a, b: cmp(a[1], b[1])) From anthonybaxter@users.sourceforge.net Thu Sep 19 09:58:00 2002 From: anthonybaxter@users.sourceforge.net (Anthony Baxter) Date: Thu, 19 Sep 2002 01:58:00 -0700 Subject: [Spambayes-checkins] website developer.ht,1.1.1.1,1.2 Message-ID: Update of /cvsroot/spambayes/website In directory usw-pr-cvs1:/tmp/cvs-serv7560 Modified Files: developer.ht Log Message: duh. Index: developer.ht =================================================================== RCS file: /cvsroot/spambayes/website/developer.ht,v retrieving revision 1.1.1.1 retrieving revision 1.2 diff -C2 -d -r1.1.1.1 -r1.2 *** developer.ht 19 Sep 2002 08:40:55 -0000 1.1.1.1 --- developer.ht 19 Sep 2002 08:57:58 -0000 1.2 *************** *** 25,29 **** or even most cases.

There's a bunch of documentation on things that have already been tried ! available as links from the documentation page.

Collecting training data

--- 25,29 ---- or even most cases.

There's a bunch of documentation on things that have already been tried ! available as links from the documentation page.

Collecting training data

From anthonybaxter@users.sourceforge.net Thu Sep 19 10:34:59 2002 From: anthonybaxter@users.sourceforge.net (Anthony Baxter) Date: Thu, 19 Sep 2002 02:34:59 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.19,1.20 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv20143 Modified Files: Options.py Log Message: if it exists, load options from file(s) specified in env var BAYESCUSTOMIZE rather than bayescustomize.ini. Much more convenient. Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.19 retrieving revision 1.20 diff -C2 -d -r1.19 -r1.20 *** Options.py 19 Sep 2002 06:30:15 -0000 1.19 --- Options.py 19 Sep 2002 09:34:56 -0000 1.20 *************** *** 4,8 **** # XXX and must not conflict with OptionsClass method names. ! import sys import StringIO import ConfigParser --- 4,8 ---- # XXX and must not conflict with OptionsClass method names. ! import sys, os import StringIO import ConfigParser *************** *** 242,245 **** del d ! options.mergefiles(['bayescustomize.ini']) --- 242,249 ---- del d ! alternate = os.getenv('BAYESCUSTOMIZE') ! if alternate: ! options.mergefiles(alternate.split()) ! else: ! options.mergefiles(['bayescustomize.ini']) From anthonybaxter@users.sourceforge.net Thu Sep 19 11:25:33 2002 From: anthonybaxter@users.sourceforge.net (Anthony Baxter) Date: Thu, 19 Sep 2002 03:25:33 -0700 Subject: [Spambayes-checkins] spambayes cmp.py,1.8,1.9 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv3207 Modified Files: cmp.py Log Message: I got sick of filename completion resulting in 'no such file foo.txt.txt', so cmp.py now looks for the provided filename if "filename".txt doesn't exist. Index: cmp.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/cmp.py,v retrieving revision 1.8 retrieving revision 1.9 diff -C2 -d -r1.8 -r1.9 *** cmp.py 14 Sep 2002 00:03:51 -0000 1.8 --- cmp.py 19 Sep 2002 10:25:31 -0000 1.9 *************** *** 70,78 **** print print f1n, '->', f2n ! fp1, fn1, fptot1, fntot1, fpmean1, fnmean1 = suck(file(f1n + '.txt')) ! fp2, fn2, fptot2, fntot2, fpmean2, fnmean2 = suck(file(f2n + '.txt')) print --- 70,88 ---- print + def windowsfy(fn): + import os + if os.path.exists(fn + '.txt'): + return fn + '.txt' + else: + return fn print f1n, '->', f2n ! ! f1n = windowsfy(f1n) ! f2n = windowsfy(f2n) ! ! fp1, fn1, fptot1, fntot1, fpmean1, fnmean1 = suck(file(f1n)) ! fp2, fn2, fptot2, fntot2, fpmean2, fnmean2 = suck(file(f2n)) print From rubiconx@users.sourceforge.net Thu Sep 19 19:15:25 2002 From: rubiconx@users.sourceforge.net (Neale Pickett) Date: Thu, 19 Sep 2002 11:15:25 -0700 Subject: [Spambayes-checkins] website/pics - New directory Message-ID: Update of /cvsroot/spambayes/website/pics In directory usw-pr-cvs1:/tmp/cvs-serv553/pics Log Message: Directory /cvsroot/spambayes/website/pics added to the repository From rubiconx@users.sourceforge.net Thu Sep 19 19:16:13 2002 From: rubiconx@users.sourceforge.net (Neale Pickett) Date: Thu, 19 Sep 2002 11:16:13 -0700 Subject: [Spambayes-checkins] website/pics banner.png,NONE,1.1 Message-ID: Update of /cvsroot/spambayes/website/pics In directory usw-pr-cvs1:/tmp/cvs-serv707/pics Added Files: banner.png Log Message: * Fixed the little picture in the corner. --- NEW FILE: banner.png --- (This appears to be a binary file; contents omitted.) From rubiconx@users.sourceforge.net Thu Sep 19 19:16:13 2002 From: rubiconx@users.sourceforge.net (Neale Pickett) Date: Thu, 19 Sep 2002 11:16:13 -0700 Subject: [Spambayes-checkins] website/scripts/ht2html SpamBayesGenerator.py,1.1.1.1,1.2 Message-ID: Update of /cvsroot/spambayes/website/scripts/ht2html In directory usw-pr-cvs1:/tmp/cvs-serv707/scripts/ht2html Modified Files: SpamBayesGenerator.py Log Message: * Fixed the little picture in the corner. Index: SpamBayesGenerator.py =================================================================== RCS file: /cvsroot/spambayes/website/scripts/ht2html/SpamBayesGenerator.py,v retrieving revision 1.1.1.1 retrieving revision 1.2 diff -C2 -d -r1.1.1.1 -r1.2 *** SpamBayesGenerator.py 19 Sep 2002 08:40:56 -0000 1.1.1.1 --- SpamBayesGenerator.py 19 Sep 2002 18:16:11 -0000 1.2 *************** *** 1,2 **** --- 1,3 ---- + #! /usr/bin/env python """Generates the www.python.org website style """ *************** *** 50,60 **** sitelink_fixer.massage(sitelinks, self.__d, aboves=1) Banner.__init__(self, sitelinks) - # calculate the random corner - # XXX Should really do a list of the pics directory... - NBANNERS = 64 - i = whrandom.randint(0, NBANNERS-1) - s = "PyBanner%03d.gif" % i - self.__d['banner'] = s - self.__whichbanner = i def get_meta(self): --- 51,54 ---- *************** *** 99,126 **** return ''' ! !

''' % \ self.__d def get_corner_bgcolor(self): ! # this may not be 100% correct. it uses PIL to get the RGB values at ! # the corners of the image and then takes a vote as to the most likely ! # value. Some images may be `bizarre'. See .../pics/backgrounds.py ! return [ ! '#3399ff', '#6699cc', '#3399ff', '#0066cc', '#3399ff', ! '#0066cc', '#0066cc', '#3399ff', '#3399ff', '#3399ff', ! '#3399ff', '#6699cc', '#3399ff', '#3399ff', '#ffffff', ! '#6699cc', '#0066cc', '#3399ff', '#0066cc', '#3399ff', ! '#6699cc', '#0066cc', '#6699cc', '#3399ff', '#3399ff', ! '#6699cc', '#3399ff', '#3399ff', '#6699cc', '#6699cc', ! '#0066cc', '#6699cc', '#0066cc', '#6699cc', '#0066cc', ! '#0066cc', '#6699cc', '#3399ff', '#0066cc', '#bbd6f1', ! '#0066cc', '#6699cc', '#3399ff', '#3399ff', '#0066cc', ! '#0066cc', '#0066cc', '#6699cc', '#6699cc', '#3399ff', ! '#3399ff', '#6699cc', '#0066cc', '#0066cc', '#6699cc', ! '#0066cc', '#6699cc', '#3399ff', '#6699cc', '#3399ff', ! '#d6ebff', '#6699cc', '#3399ff', '#0066cc', ! ][self.__whichbanner] def get_body(self): --- 93,102 ---- return ''' ! !

''' % \ self.__d def get_corner_bgcolor(self): ! return "#ffffff" def get_body(self): From rubiconx@users.sourceforge.net Thu Sep 19 23:10:09 2002 From: rubiconx@users.sourceforge.net (Neale Pickett) Date: Thu, 19 Sep 2002 15:10:09 -0700 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.23,1.24 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv15839 Modified Files: tokenizer.py Log Message: * In case of MessageParseError, just tokenize everything in the message (including headers) as though it were the body of the message. Thanks for the numerous tips, Tim! Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.23 retrieving revision 1.24 diff -C2 -d -r1.23 -r1.24 *** tokenizer.py 17 Sep 2002 17:57:39 -0000 1.23 --- tokenizer.py 19 Sep 2002 22:10:07 -0000 1.24 *************** *** 1,2 **** --- 1,3 ---- + #! /usr/bin/env python """Module to tokenize email messages for spam filtering.""" *************** *** 840,856 **** # Create an email Message object. try: ! if hasattr(obj, "readline"): ! return email.message_from_file(obj) ! else: ! return email.message_from_string(obj) except email.Errors.MessageParseError: ! return None def tokenize(self, obj): msg = self.get_message(obj) - if msg is None: - yield 'control: MessageParseError' - # XXX Fall back to the raw body text? - return for tok in self.tokenize_headers(msg): --- 841,855 ---- # Create an email Message object. try: ! if hasattr(obj, "read"): ! obj = obj.read() ! return email.message_from_string(obj) except email.Errors.MessageParseError: ! # XXX: This puts the headers in the payload... ! msg = email.Message.Message() ! msg.set_payload(obj) ! return msg def tokenize(self, obj): msg = self.get_message(obj) for tok in self.tokenize_headers(msg): From gward@users.sourceforge.net Fri Sep 20 00:29:34 2002 From: gward@users.sourceforge.net (Greg Ward) Date: Thu, 19 Sep 2002 16:29:34 -0700 Subject: [Spambayes-checkins] website docs.ht,1.1.1.1,1.2 Message-ID: Update of /cvsroot/spambayes/website In directory usw-pr-cvs1:/tmp/cvs-serv5197 Modified Files: docs.ht Log Message: Spell Tim's name right. Beef up the glossary -- tighter (and more standard, IMHO) definition of spam. Index: docs.ht =================================================================== RCS file: /cvsroot/spambayes/website/docs.ht,v retrieving revision 1.1.1.1 retrieving revision 1.2 diff -C2 -d -r1.1.1.1 -r1.2 *** docs.ht 19 Sep 2002 08:40:55 -0000 1.1.1.1 --- docs.ht 19 Sep 2002 23:29:32 -0000 1.2 *************** *** 32,36 ****

CVS commit messages

Tim Peter's has whacked a whole lot of useful information into CVS commit messages. As the project was moved from an obscure corner of the python CVS tree, there's actually two sources of CVS commits.

--- 32,36 ----

CVS commit messages

Tim Peters has whacked a whole lot of useful information into CVS commit messages. As the project was moved from an obscure corner of the python CVS tree, there's actually two sources of CVS commits.

*************** *** 52,62 ****

A useful(?) glossary of terminology

ham: a non-spam. an email that is wanted by the user. !
f-n: false negative !
f-p: false positive
false negative: a spam that's incorrectly classified as ham.
false positive: a ham that's incorrectly classified as spam. -
spam: an email that's not wanted by the end-user.

--- 52,69 ----

A useful(?) glossary of terminology

spam: broadly speaking: any email that's not wanted by the ! end-user. More specifically: unsolicited bulk email; email ! that you do not want and did not ask for, and was sent to ! a whole bunch of people by automated means at the same time ! it was sent to you. This definition deliberately excludes viruses ! and those stupid jokes sent to you by your Aunt Tillie. ! !
ham: the opposite of spam; not necessarily email that you want or ! that you asked for, just anything that's not unsolicited bulk email.
false negative: a spam that's incorrectly classified as ham.
false positive: a ham that's incorrectly classified as spam. +
f-n, FN: (abbrev.) false negative +
f-p, FP: (abbrev.) false positive

From gward@users.sourceforge.net Fri Sep 20 00:39:26 2002 From: gward@users.sourceforge.net (Greg Ward) Date: Thu, 19 Sep 2002 16:39:26 -0700 Subject: [Spambayes-checkins] website background.ht,NONE,1.1 docs.ht,1.2,1.3 links.h,1.1.1.1,1.2 Message-ID: Update of /cvsroot/spambayes/website In directory usw-pr-cvs1:/tmp/cvs-serv7867 Modified Files: docs.ht links.h Added Files: background.ht Log Message: Moved a big chunk of docs.ht to new file background.ht. --- NEW FILE: background.ht --- Title: SpamBayes: Background Reading Author-Email: spambayes@python.org

Background Reading

Theory

Sharpen your pencils, this is the mathematical background (such as it is).

The paper that started the ball rolling: Paul Graham's A Plan for Spam.
Gary Robinson has an interesting essay suggesting some improvements to Graham's original approach.

more links? mail anthony at interlink.com.au

Mailing list archives

There's a lot of background on what's been tried available from the mailing list archives. Initially, the discussion started on the python-dev list, but then moved to the spambayes list.

The fun started in August 2002. thread 1, thread 2.
But wait, there's more! In the September archive, thread 1, thread 2, thread 3, thread 4, thread 5.
The discussions then moved to the spambayes mailing list.

CVS commit messages

Tim Peters has whacked a whole lot of useful information into CVS commit messages. As the project was moved from an obscure corner of the python CVS tree, there's actually two sources of CVS commits.

The older CVS repository via view CVS, or the entire changelog. Development here stopped on the 6th of September 2002.
After that, work moved to this project's CVS tree

Index: docs.ht =================================================================== RCS file: /cvsroot/spambayes/website/docs.ht,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** docs.ht 19 Sep 2002 23:29:32 -0000 1.2 --- docs.ht 19 Sep 2002 23:39:24 -0000 1.3 *************** *** 3,44 **** Author: spambayes -

Background reading

The paper that started the ball rolling: - Paul Graham's A Plan for Spam. -
Gary Robinson has an - interesting essay - suggesting some improvements to Graham's original approach. -

more links? mail anthony at interlink.com.au

- -

Mailing list archives

There's a lot of background on what's been tried available from - the mailing list archives. Initially, the discussion started on - the python-dev list, but then moved to the - spambayes list. - -

The fun started in August 2002. - thread 1, thread 2. -
But wait, there's more! In the September archive, - thread 1, - thread 2, - thread 3, - thread 4, - thread 5. -
The discussions then moved to the spambayes mailing list. -

- -

CVS commit messages

Tim Peters has whacked a whole lot of useful information into CVS - commit messages. As the project was moved from an obscure corner of the - python CVS tree, there's actually two sources of CVS commits.

- -

The older CVS repository via view CVS, or the entire changelog. Development here stopped on the 6th of September 2002. -
After that, work moved to this project's CVS tree -

Project documentation

SpamBayes

Home page +
Background
Documentation
Developers From nascheme@users.sourceforge.net Fri Sep 20 04:14:44 2002 From: nascheme@users.sourceforge.net (Neil Schemenauer) Date: Thu, 19 Sep 2002 20:14:44 -0700 Subject: [Spambayes-checkins] spambayes neilfilter.py,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv25139 Modified Files: neilfilter.py Log Message: implement Maildir delivery. This allows the script to be used in a .qmail or .forward file without a wrapper script. Index: neilfilter.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/neilfilter.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** neilfilter.py 9 Sep 2002 21:21:54 -0000 1.1 --- neilfilter.py 20 Sep 2002 03:14:42 -0000 1.2 *************** *** 1,21 **** #! /usr/bin/env python ! """Usage: %(program)s wordprobs.cdb """ import sys import os import email from heapq import heapreplace from sets import Set from classifier import MIN_SPAMPROB, MAX_SPAMPROB, UNKNOWN_SPAMPROB, \ MAX_DISCRIMINATORS - import cdb program = sys.argv[0] # For usage(); referenced by docstring above ! from tokenizer import tokenize ! def spamprob(wordprobs, wordstream, evidence=False): """Return best-guess probability that wordstream is spam. --- 1,27 ---- #! /usr/bin/env python ! """Usage: %(program)s wordprobs.cdb Maildir Spamdir """ import sys import os + import time + import signal + import socket import email from heapq import heapreplace from sets import Set + import cdb + from tokenizer import tokenize from classifier import MIN_SPAMPROB, MAX_SPAMPROB, UNKNOWN_SPAMPROB, \ MAX_DISCRIMINATORS program = sys.argv[0] # For usage(); referenced by docstring above ! BLOCK_SIZE = 10000 ! SIZE_LIMIT = 5000000 # messages larger are not analyzed ! SPAM_THRESHOLD = 0.9 ! def spamprob(wordprobs, wordstream): """Return best-guess probability that wordstream is spam. *************** *** 24,31 **** wordstream is an iterable object producing words. The return value is a float in [0.0, 1.0]. - - If optional arg evidence is True, the return value is a pair - probability, evidence - where evidence is a list of (word, probability) pairs. """ --- 30,33 ---- *************** *** 70,74 **** # to tend in part to cancel out distortions introduced earlier by # HAMBIAS. Experiments will decide the issue. - clues = [] # First cancel out competing extreme clues (see comment block at --- 72,75 ---- *************** *** 83,89 **** # initial clues from the longer list into the probability # computation. - for dist, prob, word in shorter + longer[tokeep:]: - if evidence: - clues.append((word, prob)) for x in longer[:tokeep]: heapreplace(nbest, x) --- 84,87 ---- *************** *** 93,121 **** if prob is None: # it's one of the dummies nbest started with continue - if evidence: - clues.append((word, prob)) prob_product *= prob inverse_prob_product *= 1.0 - prob prob = prob_product / (prob_product + inverse_prob_product) ! if evidence: ! clues.sort(lambda a, b: cmp(a[1], b[1])) ! return prob, clues ! else: ! return prob ! ! def formatclues(clues, sep="; "): ! """Format the clues into something readable.""" ! return sep.join(["%r: %.2f" % (word, prob) for word, prob in clues]) ! def is_spam(wordprobs, input): ! """Filter (judge) a message""" ! msg = email.message_from_file(input) ! prob, clues = spamprob(wordprobs, tokenize(msg), True) ! #print "%.2f;" % prob, formatclues(clues) ! if prob < 0.9: ! return False ! else: ! return True def usage(code, msg=''): --- 91,118 ---- if prob is None: # it's one of the dummies nbest started with continue prob_product *= prob inverse_prob_product *= 1.0 - prob prob = prob_product / (prob_product + inverse_prob_product) ! return prob ! def maketmp(dir): ! hostname = socket.gethostname() ! pid = os.getpid() ! fd = -1 ! for x in xrange(200): ! filename = "%d.%d.%s" % (time.time(), pid, hostname) ! pathname = "%s/tmp/%s" % (dir, filename) ! try: ! fd = os.open(pathname, os.O_WRONLY|os.O_CREAT|os.O_EXCL, 0600) ! except IOError, exc: ! if exc[i] not in (errno.EINT, errno.EEXIST): ! raise ! else: ! break ! time.sleep(2) ! if fd == -1: ! raise SystemExit, "could not create a mail file" ! return (os.fdopen(fd, "wb"), pathname, filename) def usage(code, msg=''): *************** *** 128,139 **** def main(): ! if len(sys.argv) != 2: usage(2) ! wordprobs = cdb.Cdb(open(sys.argv[1], 'rb')) ! if is_spam(wordprobs, sys.stdin): ! sys.exit(1) ! else: ! sys.exit(0) if __name__ == "__main__": --- 125,171 ---- def main(): ! if len(sys.argv) != 4: usage(2) ! wordprobfilename = sys.argv[1] ! hamdir = sys.argv[2] ! spamdir = sys.argv[3] ! ! signal.signal(signal.SIGALRM, lambda s: sys.exit(1)) ! signal.alarm(24 * 60 * 60) ! ! # write message to temporary file (must be on same partition) ! tmpfile, pathname, filename = maketmp(hamdir) ! try: ! tmpfile.write(os.environ.get("DTLINE", "")) # delivered-to line ! bytes = 0 ! blocks = [] ! while 1: ! block = sys.stdin.read(BLOCK_SIZE) ! if not block: ! break ! bytes += len(block) ! if bytes < SIZE_LIMIT: ! blocks.append(block) ! tmpfile.write(block) ! tmpfile.close() ! ! if bytes < SIZE_LIMIT: ! msgdata = ''.join(blocks) ! del blocks ! msg = email.message_from_string(msgdata) ! del msgdata ! wordprobs = cdb.Cdb(open(wordprobfilename, 'rb')) ! prob = spamprob(wordprobs, tokenize(msg)) ! else: ! prob = 0.0 ! ! if prob > SPAM_THRESHOLD: ! os.rename(pathname, "%s/new/%s" % (spamdir, filename)) ! else: ! os.rename(pathname, "%s/new/%s" % (hamdir, filename)) ! except: ! os.unlink(pathname) ! raise if __name__ == "__main__": From nascheme@users.sourceforge.net Fri Sep 20 04:15:16 2002 From: nascheme@users.sourceforge.net (Neil Schemenauer) Date: Thu, 19 Sep 2002 20:15:16 -0700 Subject: [Spambayes-checkins] spambayes README.txt,1.20,1.21 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv25271 Modified Files: README.txt Log Message: Add a short description of my scripts. Index: README.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/README.txt,v retrieving revision 1.20 retrieving revision 1.21 diff -C2 -d -r1.20 -r1.21 *** README.txt 18 Sep 2002 22:01:39 -0000 1.20 --- README.txt 20 Sep 2002 03:15:13 -0000 1.21 *************** *** 66,69 **** --- 66,82 ---- delivery system. + neiltrain.py + Builds a CDB (constant database) file of word probabilities using + spam and non-spam mail. The database in intended for use with + neilfilter.py. + + neilfilter.py + A delivery agent that uses the CDB created by neiltrain.py and + delivers a message to one of two Maildir message folders, depending + on the classifier score. Note that both Maildirs must be on the + same device. An example .qmail or .forward file would be: + + |python2.3 spambayes/neilfilter.py wordprobs.cdb Maildir/ Mail/Spam/ + Concrete Test Drivers From tim_one@users.sourceforge.net Fri Sep 20 06:55:10 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Thu, 19 Sep 2002 22:55:10 -0700 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.24,1.25 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv24875 Modified Files: tokenizer.py Log Message: tokenize_headers(): Rearranged for better sanity, updated some comments, simplified overly tortured logic in basic_header_tokenize. Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.24 retrieving revision 1.25 diff -C2 -d -r1.24 -r1.25 *** tokenizer.py 19 Sep 2002 22:10:07 -0000 1.24 --- tokenizer.py 20 Sep 2002 05:55:08 -0000 1.25 *************** *** 859,900 **** def tokenize_headers(self, msg): ! # Special tagging of header lines. # Basic header tokenization ! # Tokenize the contents of each header field just like the ! # text of the message body, using the name of the header as a ! # tag. Tokens look like "header:word". The basic approach is ! # simple and effective, but also very sensitive to biases in ! # the ham and spam collections. For example, if the ham and ! # spam were collected at different times, several headers with ! # date/time information will become the best discriminators. # (Not just Date, but Received and X-From_.) if options.basic_header_tokenize: for k, v in msg.items(): k = k.lower() - match = False for rx in self.basic_skip: ! if rx.match(k) is not None: ! match = True ! continue ! if match: ! continue ! for w in subject_word_re.findall(v): ! for t in tokenize_word(w): ! yield "%s:%s" % (k, t) if options.basic_header_tokenize_only: return - - # XXX TODO Neil Schemenauer has gotten a good start on this - # XXX (pvt email). The headers in my spam and ham corpora are - # XXX so different (they came from different sources) that if - # XXX I include them the classifier's job is trivial. Only - # XXX some "safe" header lines are included here, where "safe" - # XXX is specific to my sorry corpora. - - # Content-{Type, Disposition} and their params, and charsets. - for x in msg.walk(): - for w in crack_content_xyz(x): - yield w # Subject: --- 859,904 ---- def tokenize_headers(self, msg): ! # Special tagging of header lines and MIME metadata. ! ! # Content-{Type, Disposition} and their params, and charsets. ! # This is done for all MIME sections. ! for x in msg.walk(): ! for w in crack_content_xyz(x): ! yield w ! ! # The rest is solely tokenization of header lines. ! # XXX The headers in my (Tim's) spam and ham corpora are so different ! # XXX (they came from different sources) that including several kinds ! # XXX of header analysis renders the classifier's job trivial. So ! # XXX lots of this is crippled now, controlled by an ever-growing ! # XXX collection of funky options. # Basic header tokenization ! # Tokenize the contents of each header field in the way Subject lines ! # are tokenized later. ! # XXX Different kinds of tokenization have gotten better results on ! # XXX different header lines. No experiments have been run on ! # XXX whether the best choice is being made for each of the header ! # XXX lines tokenized by this section. ! # The name of the header is used as a tag. Tokens look like ! # "header:word". The basic approach is simple and effective, but ! # also very sensitive to biases in the ham and spam collections. ! # For example, if the ham and spam were collected at different ! # times, several headers with date/time information will become ! # the best discriminators. # (Not just Date, but Received and X-From_.) if options.basic_header_tokenize: for k, v in msg.items(): k = k.lower() for rx in self.basic_skip: ! if rx.match(k): ! break # do nothing -- we're supposed to skip this ! else: ! # Never found a match -- don't skip this. ! for w in subject_word_re.findall(v): ! for t in tokenize_word(w): ! yield "%s:%s" % (k, t) if options.basic_header_tokenize_only: return # Subject: From tim_one@users.sourceforge.net Fri Sep 20 07:00:08 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Thu, 19 Sep 2002 23:00:08 -0700 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.25,1.26 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv25864 Modified Files: tokenizer.py Log Message: crack_uuencode(): Added a note about an obscure efficiency gimmick I relied on but didn't think to mention before. Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.25 retrieving revision 1.26 diff -C2 -d -r1.25 -r1.26 *** tokenizer.py 20 Sep 2002 05:55:08 -0000 1.25 --- tokenizer.py 20 Sep 2002 06:00:06 -0000 1.26 *************** *** 769,773 **** # is (new_text, sequence_of_tokens), where new_text no longer contains # uuencoded stuff. Note that we're not bothering to decode it! Maybe ! # we should. def crack_uuencode(text): new_text = [] --- 769,781 ---- # is (new_text, sequence_of_tokens), where new_text no longer contains # uuencoded stuff. Note that we're not bothering to decode it! Maybe ! # we should. One of my persistent false negatives is a spam containing ! # nothing but a uuencoded money.txt; OTOH, uuencoded seems to be on ! # its way out (that's an old spam). ! # ! # Efficiency note: This is cheaper than it looks if there aren't any ! # uuencoded sections. Under the covers, string[0:] is optimized to ! # return string (no new object is built), and likewise ''.join([string]) ! # is optimized to return string. It would actually slow this code down ! # to special-case these "do nothing" special cases at the Python level! def crack_uuencode(text): new_text = [] From tim_one@users.sourceforge.net Fri Sep 20 07:03:14 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Thu, 19 Sep 2002 23:03:14 -0700 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.26,1.27 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv26473 Modified Files: tokenizer.py Log Message: Removed the code in support of tokenizing src= thingies. It was all commented out because it made no difference when enabled. Note that we pick up all http:// thingies regardless of their context anyway. Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.26 retrieving revision 1.27 diff -C2 -d -r1.26 -r1.27 *** tokenizer.py 20 Sep 2002 06:00:06 -0000 1.26 --- tokenizer.py 20 Sep 2002 06:03:12 -0000 1.27 *************** *** 578,593 **** subject_word_re = re.compile(r"[\w\x80-\xff$.%]+") - # Anthony Baxter reported goodness from cracking src params. - # Finding a src= thingie is complicated if we insist it appear in an - # img or iframe tag, so this approximates reality with a fast and - # non-stack-blowing simple regexp. - src_re = re.compile(r""" - \s - src=['"] - (?!https?:) # we suck out http thingies via a different gimmick - ([^'"]{1,128}) # capture the guts, but don't go wild - ['"] - """, re.VERBOSE) - fname_sep_re = re.compile(r'[/\\:]') --- 578,581 ---- *************** *** 1012,1026 **** for t in tokens: yield t - - # Anthony Baxter reported goodness from tokenizing src= params. - # XXX This made no difference in my tests: both error rates - # XXX across 20 runs were identical before and after. I suspect - # XXX this is because Anthony got most good out of the http - # XXX thingies in , but we - # XXX picked those up in the last step (in src params and - # XXX everywhere else). So this code is commented out. - ## for fname in src_re.findall(text): - ## for x in crack_filename(fname): - ## yield "src:" + x # Remove HTML/XML tags. --- 1000,1003 ---- From tim_one@users.sourceforge.net Fri Sep 20 07:06:15 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Thu, 19 Sep 2002 23:06:15 -0700 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.27,1.28 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv27328 Modified Files: tokenizer.py Log Message: tokenize_body(): Brought the docstring into line with current reality. Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.27 retrieving revision 1.28 diff -C2 -d -r1.27 -r1.28 *** tokenizer.py 20 Sep 2002 06:03:12 -0000 1.27 --- tokenizer.py 20 Sep 2002 06:06:13 -0000 1.28 *************** *** 965,976 **** """Generate a stream of tokens from an email Message. - If a multipart/alternative section has both text/plain and text/html - sections, the text/html section is ignored. This may not be a good - idea (e.g., the sections may have different content). - HTML tags are always stripped from text/plain sections. - options.retain_pure_html_tags controls whether HTML tags are ! also stripped from text/html sections. """ --- 965,977 ---- """Generate a stream of tokens from an email Message. HTML tags are always stripped from text/plain sections. options.retain_pure_html_tags controls whether HTML tags are ! also stripped from text/html sections. Except in special cases, ! it's recommended to leave that at its default of false. ! ! If a multipart/alternative section has both text/plain and text/html ! sections, options.ignore_redundant_html controls whether the HTML ! part is ignored. Except in special cases, it's recommended to ! leave that at its default of false. """ From tim_one@users.sourceforge.net Fri Sep 20 07:18:26 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Thu, 19 Sep 2002 23:18:26 -0700 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.28,1.29 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv29244 Modified Files: tokenizer.py Log Message: get_message(): Added docstring. Reduced useless nesting. Moved inappropriate code out of a try block. In case of a message parse error, used a cheap trick to try to get rid of the (probably malformed) headers before wrapping the text in a bare Message object. Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.28 retrieving revision 1.29 diff -C2 -d -r1.28 -r1.29 *** tokenizer.py 20 Sep 2002 06:06:13 -0000 1.28 --- tokenizer.py 20 Sep 2002 06:18:24 -0000 1.29 *************** *** 832,848 **** def get_message(self, obj): if isinstance(obj, email.Message.Message): return obj ! else: ! # Create an email Message object. ! try: ! if hasattr(obj, "read"): ! obj = obj.read() ! return email.message_from_string(obj) ! except email.Errors.MessageParseError: ! # XXX: This puts the headers in the payload... ! msg = email.Message.Message() ! msg.set_payload(obj) ! return msg def tokenize(self, obj): --- 832,864 ---- def get_message(self, obj): + """Return an email Message object. + + The argument may be a Message object already, in which case it's + returned as-is. + + If the argument is a string or file-like object (supports read()), + the email package is used to create a Message object from it. This + can fail if the message is malformed. In that case, the headers + (everything through the first blank line) are thrown out, and the + rest of the text is wrapped in a bare email.Message.Message. + """ + if isinstance(obj, email.Message.Message): return obj ! # Create an email Message object. ! if hasattr(obj, "read"): ! obj = obj.read() ! try: ! msg = email.message_from_string(obj) ! except email.Errors.MessageParseError: ! # Wrap the raw text in a bare Message object. Since the ! # headers are most likely damaged, we can't use the email ! # package to parse them, so just get rid of them first. ! i = obj.find('\n\n') ! if i >= 0: ! obj = obj[i+2:] # strip headers ! msg = email.Message.Message() ! msg.set_payload(obj) ! return msg def tokenize(self, obj): From montanaro@users.sourceforge.net Fri Sep 20 16:24:57 2002 From: montanaro@users.sourceforge.net (Skip Montanaro) Date: Fri, 20 Sep 2002 08:24:57 -0700 Subject: [Spambayes-checkins] spambayes .cvsignore,1.2,1.3 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv16668 Modified Files: .cvsignore Log Message: ignore the Data directory Index: .cvsignore =================================================================== RCS file: /cvsroot/spambayes/spambayes/.cvsignore,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** .cvsignore 7 Sep 2002 05:53:12 -0000 1.2 --- .cvsignore 20 Sep 2002 15:24:54 -0000 1.3 *************** *** 5,6 **** --- 5,7 ---- *.zip build + Data From gvanrossum@users.sourceforge.net Fri Sep 20 20:30:54 2002 From: gvanrossum@users.sourceforge.net (Guido van Rossum) Date: Fri, 20 Sep 2002 12:30:54 -0700 Subject: [Spambayes-checkins] spambayes mboxutils.py,NONE,1.1 hammie.py,1.17,1.18 splitndirs.py,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv3545 Modified Files: hammie.py splitndirs.py Added Files: mboxutils.py Log Message: Moved hammie's getmbox out to a separate module, mboxutils, and enhanced it to support a syntax to designate multiple MH mailboxes. Augmented splitndirs.py to use this so it can work on MH mailbox directories as well as on Unix mailboxes. --- NEW FILE: mboxutils.py --- """Utilities for dealing with various types of mailboxes. This is mostly a wrapper around the various useful classes in the standard mailbox module, to do some intelligent guessing of the mailbox type given a mailbox argument. +foo -- MH mailbox +foo +foo,bar -- MH mailboxes +foo and +bar concatenated +ALL -- a shortcut for *all* MH mailboxes /foo/bar -- (existing file) a Unix-style mailbox /foo/bar/ -- (existing directory) a directory full of .txt and .lorien files /foo/Mail/bar/ -- (existing directory with /Mail/ in its path) alternative way of spelling an MH mailbox """ from __future__ import generators import os import glob import email import mailbox class DirOfTxtFileMailbox: """Mailbox directory consisting of .txt and .lorien files.""" def __init__(self, dirname, factory): self.names = (glob.glob(os.path.join(dirname, "*.txt")) + glob.glob(os.path.join(dirname, "*.lorien"))) self.names.sort() self.factory = factory def __iter__(self): for name in self.names: try: f = open(name) except IOError: continue yield self.factory(f) f.close() def _factory(fp): # Helper for getmbox try: return email.message_from_file(fp) except email.Errors.MessageParseError: return '' def _cat(seqs): for seq in seqs: for item in seq: yield item def getmbox(name): """Return an mbox iterator given a file/directory/folder name.""" if name.startswith("+"): # MH folder name: +folder, +f1,f2,f2, or +ALL name = name[1:] import mhlib mh = mhlib.MH() if name == "ALL": names = mh.listfolders() elif ',' in name: names = name.split(',') else: names = [name] mboxes = [] mhpath = mh.getpath() for name in names: filename = os.path.join(mhpath, name) mbox = mailbox.MHMailbox(filename, _factory) mboxes.append(mbox) if len(mboxes) == 1: return iter(mboxes[0]) else: return _cat(mboxes) if os.path.isdir(name): # XXX Bogus: use an MHMailbox if the pathname contains /Mail/, # else a DirOfTxtFileMailbox. if name.find("/Mail/") >= 0: mbox = mailbox.MHMailbox(name, _factory) else: mbox = DirOfTxtFileMailbox(name, _factory) else: fp = open(name) mbox = mailbox.PortableUnixMailbox(fp, _factory) return iter(mbox) Index: hammie.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammie.py,v retrieving revision 1.17 retrieving revision 1.18 diff -C2 -d -r1.17 -r1.18 *** hammie.py 18 Sep 2002 22:01:39 -0000 1.17 --- hammie.py 20 Sep 2002 19:30:52 -0000 1.18 *************** *** 34,42 **** import glob import email - import classifier import errno import anydbm import cPickle as pickle program = sys.argv[0] # For usage(); referenced by docstring above --- 34,44 ---- import glob import email import errno import anydbm import cPickle as pickle + import mboxutils + import classifier + program = sys.argv[0] # For usage(); referenced by docstring above *************** *** 171,220 **** - class DirOfTxtFileMailbox: - - """Mailbox directory consisting of .txt files.""" - - def __init__(self, dirname, factory): - self.names = glob.glob(os.path.join(dirname, "*.txt")) - self.factory = factory - - def __iter__(self): - for name in self.names: - try: - f = open(name) - except IOError: - continue - yield self.factory(f) - f.close() - - - def getmbox(msgs): - """Return an iterable mbox object given a file/directory/folder name.""" - def _factory(fp): - try: - return email.message_from_file(fp) - except email.Errors.MessageParseError: - return '' - - if msgs.startswith("+"): - import mhlib - mh = mhlib.MH() - mbox = mailbox.MHMailbox(os.path.join(mh.getpath(), msgs[1:]), - _factory) - elif os.path.isdir(msgs): - # XXX Bogus: use an MHMailbox if the pathname contains /Mail/, - # else a DirOfTxtFileMailbox. - if msgs.find("/Mail/") >= 0: - mbox = mailbox.MHMailbox(msgs, _factory) - else: - mbox = DirOfTxtFileMailbox(msgs, _factory) - else: - fp = open(msgs) - mbox = mailbox.PortableUnixMailbox(fp, _factory) - return mbox - def train(bayes, msgs, is_spam): """Train bayes with all messages from a mailbox.""" ! mbox = getmbox(msgs) i = 0 for msg in mbox: --- 173,179 ---- def train(bayes, msgs, is_spam): """Train bayes with all messages from a mailbox.""" ! mbox = mboxutils.getmbox(msgs) i = 0 for msg in mbox: *************** *** 247,251 **** """Score (judge) all messages from a mailbox.""" # XXX The reporting needs work! ! mbox = getmbox(msgs) i = 0 spams = hams = 0 --- 206,210 ---- """Score (judge) all messages from a mailbox.""" # XXX The reporting needs work! ! mbox = mboxutils.getmbox(msgs) i = 0 spams = hams = 0 Index: splitndirs.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/splitndirs.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** splitndirs.py 8 Sep 2002 12:55:33 -0000 1.1 --- splitndirs.py 20 Sep 2002 19:30:52 -0000 1.2 *************** *** 47,50 **** --- 47,52 ---- import getopt + import mboxutils + program = sys.argv[0] *************** *** 86,90 **** inputpath, outputbasepath = args - infile = file(inputpath, 'rb') outdirs = [outputbasepath + ("%d" % i) for i in range(1, n+1)] for dir in outdirs: --- 88,91 ---- *************** *** 92,96 **** os.makedirs(dir) ! mbox = mailbox.PortableUnixMailbox(infile, _factory) counter = 0 for msg in mbox: --- 93,97 ---- os.makedirs(dir) ! mbox = mboxutils.getmbox(inputpath) counter = 0 for msg in mbox: *************** *** 104,113 **** if verbose: if counter % 100 == 0: ! print '.', if verbose: print print counter, "messages split into", n, "directories" - infile.close() if __name__ == '__main__': --- 105,114 ---- if verbose: if counter % 100 == 0: ! sys.stdout.write('.') ! sys.stdout.flush() if verbose: print print counter, "messages split into", n, "directories" if __name__ == '__main__': From gvanrossum@users.sourceforge.net Fri Sep 20 20:32:28 2002 From: gvanrossum@users.sourceforge.net (Guido van Rossum) Date: Fri, 20 Sep 2002 12:32:28 -0700 Subject: [Spambayes-checkins] spambayes neiltrain.py,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv4565 Modified Files: neiltrain.py Log Message: Use mboxutils instead of a copy of getmbox(). Index: neiltrain.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/neiltrain.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** neiltrain.py 9 Sep 2002 21:21:54 -0000 1.1 --- neiltrain.py 20 Sep 2002 19:32:26 -0000 1.2 *************** *** 8,13 **** --- 8,15 ---- import mailbox import email + import classifier import cdb + import mboxutils program = sys.argv[0] # For usage(); referenced by docstring above *************** *** 15,46 **** from tokenizer import tokenize - def getmbox(msgs): - """Return an iterable mbox object""" - def _factory(fp): - try: - return email.message_from_file(fp) - except email.Errors.MessageParseError: - return '' - - if msgs.startswith("+"): - import mhlib - mh = mhlib.MH() - mbox = mailbox.MHMailbox(os.path.join(mh.getpath(), msgs[1:]), - _factory) - elif os.path.isdir(msgs): - # XXX Bogus: use an MHMailbox if the pathname contains /Mail/, - # else a DirOfTxtFileMailbox. - if msgs.find("/Mail/") >= 0: - mbox = mailbox.MHMailbox(msgs, _factory) - else: - mbox = DirOfTxtFileMailbox(msgs, _factory) - else: - fp = open(msgs) - mbox = mailbox.PortableUnixMailbox(fp, _factory) - return mbox - def train(bayes, msgs, is_spam): """Train bayes with all messages from a mailbox.""" ! mbox = getmbox(msgs) for msg in mbox: bayes.learn(tokenize(msg), is_spam, False) --- 17,23 ---- from tokenizer import tokenize def train(bayes, msgs, is_spam): """Train bayes with all messages from a mailbox.""" ! mbox = mboxutils.getmbox(msgs) for msg in mbox: bayes.learn(tokenize(msg), is_spam, False) From gvanrossum@users.sourceforge.net Fri Sep 20 21:00:48 2002 From: gvanrossum@users.sourceforge.net (Guido van Rossum) Date: Fri, 20 Sep 2002 13:00:48 -0700 Subject: [Spambayes-checkins] spambayes splitndirs.py,1.2,1.3 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv13883 Modified Files: splitndirs.py Log Message: Another refinement: in order to make nice training sets out of Bruce G's spam collections, this script now supports multiple input mboxes. Index: splitndirs.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/splitndirs.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** splitndirs.py 20 Sep 2002 19:30:52 -0000 1.2 --- splitndirs.py 20 Sep 2002 20:00:45 -0000 1.3 *************** *** 3,7 **** """Split an mbox into N random directories of files. ! Usage: %(program)s [-h] [-s seed] [-v] -n N sourcembox outdirbase Options: --- 3,7 ---- """Split an mbox into N random directories of files. ! Usage: %(program)s [-h] [-s seed] [-v] -n N sourcembox ... outdirbase Options: *************** *** 84,90 **** usage(1, "an -n value > 1 is required") ! if len(args) != 2: usage(1, "input mbox name and output base path are required") ! inputpath, outputbasepath = args outdirs = [outputbasepath + ("%d" % i) for i in range(1, n+1)] --- 84,90 ---- usage(1, "an -n value > 1 is required") ! if len(args) < 2: usage(1, "input mbox name and output base path are required") ! inputpaths, outputbasepath = args[:-1], args[-1] outdirs = [outputbasepath + ("%d" % i) for i in range(1, n+1)] *************** *** 93,110 **** os.makedirs(dir) - mbox = mboxutils.getmbox(inputpath) counter = 0 ! for msg in mbox: ! i = random.randrange(n) ! astext = str(msg) ! #assert astext.endswith('\n') ! counter += 1 ! msgfile = open('%s/%d' % (outdirs[i], counter), 'wb') ! msgfile.write(astext) ! msgfile.close() ! if verbose: ! if counter % 100 == 0: ! sys.stdout.write('.') ! sys.stdout.flush() if verbose: --- 93,111 ---- os.makedirs(dir) counter = 0 ! for inputpath in inputpaths: ! mbox = mboxutils.getmbox(inputpath) ! for msg in mbox: ! i = random.randrange(n) ! astext = str(msg) ! #assert astext.endswith('\n') ! counter += 1 ! msgfile = open('%s/%d' % (outdirs[i], counter), 'wb') ! msgfile.write(astext) ! msgfile.close() ! if verbose: ! if counter % 100 == 0: ! sys.stdout.write('.') ! sys.stdout.flush() if verbose: From tim_one@users.sourceforge.net Sat Sep 21 01:15:18 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Fri, 20 Sep 2002 17:15:18 -0700 Subject: [Spambayes-checkins] spambayes classifier.py,1.13,1.14 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv24783 Modified Files: classifier.py Log Message: Removed xspamprob() -- it's unused. Index: classifier.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/classifier.py,v retrieving revision 1.13 retrieving revision 1.14 diff -C2 -d -r1.13 -r1.14 *** classifier.py 19 Sep 2002 06:30:15 -0000 1.13 --- classifier.py 21 Sep 2002 00:15:16 -0000 1.14 *************** *** 361,535 **** return prob - # The same as spamprob(), except uses a corrected probability computation - # accounting for P(spam) and P(not-spam). Since my training corpora had - # a ham/spam ratio of 4000/2750, I'm in a good position to test this. - # Using xspamprob() clearly made a major reduction in the false negative - # rate, cutting it in half on some runs (this is after the f-n rate had - # already been cut by a factor of 5 via other refinements). It also - # uncovered two more very brief spams hiding in the ham corpora. - # - # OTOH, the # of fps increased. Especially vulnerable are extremely - # short msgs of the "subscribe me"/"unsubscribe me" variety (while these - # don't belong on a mailing list, they're not spam), and brief reasonable - # msgs that simply don't have much evidence (to the human eye) to go on. - # These were boderline before, and it's easy to push them over the edge. - # For example, one f-p had subject - # - # Any Interest in EDIFACT Parser/Generator? - # - # and the body just - # - # Just curious. - # --jim - # - # "Interest" in the subject line had spam prob 0.99, "curious." 0.01, - # and nothing else was strong. Since my ham/spam ratio is bigger than - # 1, any clue favoring spam favors spam more strongly under xspamprob() - # than under spamprob(). - # - # XXX Somewhat like spamprob(), learn() also computes probabilities as - # XXX if the # of hams and spams were the same. If that were also - # XXX fiddled to take nham and nspam into account (nb: I realize it - # XXX already *looks* like it does -- but it doesn't), it would reduce - # XXX the spam probabilities in my test run, and *perhaps* xspamprob - # XXX wouldn't have such bad effect on the f-p story. - # - # Here are the comparative stats, with spamprob() in the left column and - # xspamprob() in the right, across 20 runs: - # - # false positive percentages - # 0.000 0.000 tied - # 0.000 0.050 lost - # 0.050 0.100 lost - # 0.000 0.075 lost - # 0.025 0.050 lost - # 0.025 0.100 lost - # 0.050 0.150 lost - # 0.025 0.050 lost - # 0.025 0.050 lost - # 0.000 0.050 lost - # 0.075 0.150 lost - # 0.050 0.075 lost - # 0.025 0.050 lost - # 0.000 0.050 lost - # 0.050 0.125 lost - # 0.025 0.075 lost - # 0.025 0.025 tied - # 0.000 0.025 lost - # 0.025 0.100 lost - # 0.050 0.150 lost - # - # won 0 times - # tied 2 times - # lost 18 times - # - # total unique fp went from 8 to 30 - # - # false negative percentages - # 0.945 0.473 won - # 0.836 0.582 won - # 1.200 0.618 won - # 1.418 0.836 won - # 1.455 0.836 won - # 1.091 0.691 won - # 1.091 0.618 won - # 1.236 0.691 won - # 1.564 1.018 won - # 1.236 0.618 won - # 1.563 0.981 won - # 1.563 0.800 won - # 1.236 0.618 won - # 0.836 0.400 won - # 0.873 0.400 won - # 1.236 0.545 won - # 1.273 0.691 won - # 1.018 0.327 won - # 1.091 0.473 won - # 1.490 0.618 won - # - # won 20 times - # tied 0 times - # lost 0 times - # - # total unique fn went from 292 to 162 - # - # XXX This needs to be updated to incorporate the "cancel out competing - # XXX extreme clues" twist. - def xspamprob(self, wordstream, evidence=False): - """Return best-guess probability that wordstream is spam. - - wordstream is an iterable object producing words. - The return value is a float in [0.0, 1.0]. - - If optional arg evidence is True, the return value is a pair - probability, evidence - where evidence is a list of (word, probability) pairs. - """ - - # A priority queue to remember the MAX_DISCRIMINATORS best - # probabilities, where "best" means largest distance from 0.5. - # The tuples are (distance, prob, word, wordinfo[word]). - nbest = [(-1.0, None, None, None)] * MAX_DISCRIMINATORS - smallest_best = -1.0 - - # Counting a unique word multiple times hurts, although counting one - # at most two times had some benefit whan UNKNOWN_SPAMPROB was 0.2. - # When that got boosted to 0.5, counting more than once became - # counterproductive. - unique_words = {} - - wordinfoget = self.wordinfo.get - now = time.time() - - for word in wordstream: - if word in unique_words: - continue - unique_words[word] = 1 - - record = wordinfoget(word) - if record is None: - prob = UNKNOWN_SPAMPROB - else: - record.atime = now - prob = record.spamprob - - distance = abs(prob - 0.5) - if distance > smallest_best: - # Subtle: we didn't use ">" instead of ">=" just to save - # calls to heapreplace(). The real intent is that if - # there are many equally strong indicators throughout the - # message, we want to favor the ones that appear earliest: - # it's expected that spam headers will often have smoking - # guns, and, even when not, spam has to grab your attention - # early (& note that when spammers generate large blocks of - # random gibberish to throw off exact-match filters, it's - # always at the end of the msg -- if they put it at the - # start, *nobody* would read the msg). - heapreplace(nbest, (distance, prob, word, record)) - smallest_best = nbest[0][0] - - # Compute the probability. - if evidence: - clues = [] - sp = float(self.nspam) / (self.nham + self.nspam) - hp = 1.0 - sp - prob_product = sp - inverse_prob_product = hp - for distance, prob, word, record in nbest: - if prob is None: # it's one of the dummies nbest started with - continue - if record is not None: # else wordinfo doesn't know about it - record.killcount += 1 - if evidence: - clues.append((word, prob)) - prob_product *= prob / sp - inverse_prob_product *= (1.0 - prob) / hp - - prob = prob_product / (prob_product + inverse_prob_product) - if evidence: - return prob, clues - else: - return prob - def learn(self, wordstream, is_spam, update_probabilities=True): --- 361,364 ---- From tim_one@users.sourceforge.net Sat Sep 21 03:46:23 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Fri, 20 Sep 2002 19:46:23 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.20,1.21 classifier.py,1.14,1.15 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv20258 Modified Files: Options.py classifier.py Log Message: Added some speculative options for more of Gary Robinson's ideas. Will explain on the spambayes list. Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.20 retrieving revision 1.21 diff -C2 -d -r1.20 -r1.21 *** Options.py 19 Sep 2002 09:34:56 -0000 1.20 --- Options.py 21 Sep 2002 02:46:20 -0000 1.21 *************** *** 150,155 **** max_discriminators: 16 ! # Use Gary Robinson's scheme for combining probabilities. use_robinson_probability: False """ --- 150,168 ---- max_discriminators: 16 ! ########################################################################### ! # Speculative options for Gary Robinson's ideas. These may go away, or ! # a bunch of incompatible stuff above may go away. ! ! # Use Gary's scheme for combining probabilities. ! use_robinson_combining: False ! ! # Use Gary's scheme for computing probabilities, along with its "a" and ! # "x" parameters. use_robinson_probability: False + robinson_probability_a: 1.0 + robinson_probability_x: 0.5 + + # Use Gary's scheme for ranking probabilities. + use_robinson_ranking: False """ *************** *** 189,193 **** --- 202,210 ---- 'unknown_spamprob': float_cracker, 'max_discriminators': int_cracker, + 'use_robinson_combining': boolean_cracker, 'use_robinson_probability': boolean_cracker, + 'robinson_probability_a': float_cracker, + 'robinson_probability_x': float_cracker, + 'use_robinson_ranking': boolean_cracker, }, } Index: classifier.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/classifier.py,v retrieving revision 1.14 retrieving revision 1.15 diff -C2 -d -r1.14 -r1.15 *** classifier.py 21 Sep 2002 00:15:16 -0000 1.14 --- classifier.py 21 Sep 2002 02:46:20 -0000 1.15 *************** *** 314,357 **** heapreplace(nbest, x) ! if options.use_robinson_probability: ! # This combination method is due to Gary Robinson. ! # http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html ! # In preliminary tests, it did just as well as Graham's scheme, ! # but creates a definite "middle ground" around 0.5 where false ! # negatives and false positives can actually found in non-trivial ! # number. ! P = Q = 1.0 ! num_clues = 0 ! for distance, prob, word, record in nbest: ! if prob is None: # it's one of the dummies nbest started with ! continue ! if record is not None: # else wordinfo doesn't know about it ! record.killcount += 1 ! if evidence: ! clues.append((word, prob)) ! num_clues += 1 ! P *= 1.0 - prob ! Q *= prob ! ! if num_clues: ! P = 1.0 - P**(1./num_clues) ! Q = 1.0 - Q**(1./num_clues) ! prob = (P-Q)/(P+Q) # in -1 .. 1 ! prob = 0.5 + prob/2 # shift to 0 .. 1 ! else: ! prob = 0.5 ! else: ! prob_product = inverse_prob_product = 1.0 ! for distance, prob, word, record in nbest: ! if prob is None: # it's one of the dummies nbest started with ! continue ! if record is not None: # else wordinfo doesn't know about it ! record.killcount += 1 ! if evidence: ! clues.append((word, prob)) ! prob_product *= prob ! inverse_prob_product *= 1.0 - prob ! prob = prob_product / (prob_product + inverse_prob_product) if evidence: --- 314,329 ---- heapreplace(nbest, x) ! prob_product = inverse_prob_product = 1.0 ! for distance, prob, word, record in nbest: ! if prob is None: # it's one of the dummies nbest started with ! continue ! if record is not None: # else wordinfo doesn't know about it ! record.killcount += 1 ! if evidence: ! clues.append((word, prob)) ! prob_product *= prob ! inverse_prob_product *= 1.0 - prob ! prob = prob_product / (prob_product + inverse_prob_product) if evidence: *************** *** 361,365 **** return prob - def learn(self, wordstream, is_spam, update_probabilities=True): """Teach the classifier by example. --- 333,336 ---- *************** *** 479,480 **** --- 450,601 ---- if record.hamcount == 0 == record.spamcount: del self.wordinfo[word] + + + #************************************************************************ + # Some options change so much behavior that it's better to write a + # different method. + # CAUTION: These end up overwriting methods of the same name above. + # A subclass would be cleaner, but experiments will soon enough lead + # to only one of the alternatives surviving. + + def robinson_spamprob(self, wordstream, evidence=False): + """Return best-guess probability that wordstream is spam. + + wordstream is an iterable object producing words. + The return value is a float in [0.0, 1.0]. + + If optional arg evidence is True, the return value is a pair + probability, evidence + where evidence is a list of (word, probability) pairs. + """ + + from math import frexp + + # A priority queue to remember the MAX_DISCRIMINATORS best + # probabilities, where "best" means largest distance from 0.5. + # The tuples are (distance, prob, word, wordinfo[word]). + nbest = [(-1.0, None, None, None)] * MAX_DISCRIMINATORS + smallest_best = -1.0 + + wordinfoget = self.wordinfo.get + now = time.time() + for word in Set(wordstream): + record = wordinfoget(word) + if record is None: + prob = UNKNOWN_SPAMPROB + else: + record.atime = now + prob = record.spamprob + + distance = abs(prob - 0.5) + if distance > smallest_best: + heapreplace(nbest, (distance, prob, word, record)) + smallest_best = nbest[0][0] + + # Compute the probability. + clues = [] + + # This combination method is due to Gary Robinson. + # http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html + # In preliminary tests, it did just as well as Graham's scheme, + # but creates a definite "middle ground" around 0.5 where false + # negatives and false positives can actually found in non-trivial + # number. + + # The real P = this P times 2**Pexp. Likewise for Q. We're + # simulating unbounding dynamic float range by hand. If this pans + # out, *maybe* we should store logarithms in the database instead + # and just add them here. + P = Q = 1.0 + Pexp = Qexp = 0 + num_clues = 0 + for distance, prob, word, record in nbest: + if prob is None: # it's one of the dummies nbest started with + continue + if record is not None: # else wordinfo doesn't know about it + record.killcount += 1 + if evidence: + clues.append((word, prob)) + num_clues += 1 + P *= 1.0 - prob + Q *= prob + if P < 1e-200: # move back into range + P, e = frexp(P) + Pexp += e + if Q < 1e-200: # move back into range + Q, e = frexp(Q) + Qexp += e + + P, e = frexp(P) + Pexp += e + Q, e = frexp(Q) + Qexp += e + + if num_clues: + #P = 1.0 - P**(1./num_clues) + #Q = 1.0 - Q**(1./num_clues) + # + # (x*2**e)**n = x**n * 2**(e*n) + n = 1.0 / num_clues + P = 1.0 - P**n * 2.0**(Pexp * n) + Q = 1.0 - P**n * 2.0**(Qexp * n) + + prob = (P-Q)/(P+Q) # in -1 .. 1 + prob = 0.5 + prob/2 # shift to 0 .. 1 + else: + prob = 0.5 + + if evidence: + clues.sort(lambda a, b: cmp(a[1], b[1])) + return prob, clues + else: + return prob + + if options.use_robinson_combining: + spamprob = robinson_spamprob + + def robinson_update_probabilities(self): + """Update the word probabilities in the spam database. + + This computes a new probability for every word in the database, + so can be expensive. learn() and unlearn() update the probabilities + each time by default. Thay have an optional argument that allows + to skip this step when feeding in many messages, and in that case + you should call update_probabilities() after feeding the last + message and before calling spamprob(). + """ + + nham = float(self.nham or 1) + nspam = float(self.nspam or 1) + A = options.robinson_probability_a + X = options.robinson_probability_x + AoverX = A/X + for word, record in self.wordinfo.iteritems(): + # Compute prob(msg is spam | msg contains word). + # This is the Graham calculation, but stripped of biases, and + # of clamping into 0.01 thru 0.99. + hamcount = min(record.hamcount, nham) + hamratio = hamcount / nham + + spamcount = min(record.spamcount, nspam) + spamratio = spamcount / nspam + + prob = spamratio / (hamratio + spamratio) + + # Now do Robinson's Bayesian adjustment. + # + # a + (n * p(w)) + # f(w) = --------------- + # (a / x) + n + n = hamcount + spamratio + prob = (A + n * prob) / (AoverX + n) + + if record.spamprob != prob: + record.spamprob = prob + # The next seemingly pointless line appears to be a hack + # to allow a persistent db to realize the record has changed. + self.wordinfo[word] = record + + + if options.use_robinson_probability: + update_probabilities = robinson_update_probabilities From tim_one@users.sourceforge.net Sat Sep 21 04:43:15 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Fri, 20 Sep 2002 20:43:15 -0700 Subject: [Spambayes-checkins] spambayes classifier.py,1.15,1.16 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv30766 Modified Files: classifier.py Log Message: Fixed two egregious typos in the code (one a cut 'n paste screwup, the other a word-completion snafu). Curiously, I don't think that repairing the math is actually going to make much difference! Index: classifier.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/classifier.py,v retrieving revision 1.15 retrieving revision 1.16 diff -C2 -d -r1.15 -r1.16 *** classifier.py 21 Sep 2002 02:46:20 -0000 1.15 --- classifier.py 21 Sep 2002 03:43:13 -0000 1.16 *************** *** 504,508 **** # The real P = this P times 2**Pexp. Likewise for Q. We're ! # simulating unbounding dynamic float range by hand. If this pans # out, *maybe* we should store logarithms in the database instead # and just add them here. --- 504,508 ---- # The real P = this P times 2**Pexp. Likewise for Q. We're ! # simulating unbounded dynamic float range by hand. If this pans # out, *maybe* we should store logarithms in the database instead # and just add them here. *************** *** 539,543 **** n = 1.0 / num_clues P = 1.0 - P**n * 2.0**(Pexp * n) ! Q = 1.0 - P**n * 2.0**(Qexp * n) prob = (P-Q)/(P+Q) # in -1 .. 1 --- 539,543 ---- n = 1.0 / num_clues P = 1.0 - P**n * 2.0**(Pexp * n) ! Q = 1.0 - Q**n * 2.0**(Qexp * n) prob = (P-Q)/(P+Q) # in -1 .. 1 *************** *** 588,592 **** # f(w) = --------------- # (a / x) + n ! n = hamcount + spamratio prob = (A + n * prob) / (AoverX + n) --- 588,593 ---- # f(w) = --------------- # (a / x) + n ! ! n = hamcount + spamcount prob = (A + n * prob) / (AoverX + n) From montanaro@users.sourceforge.net Sat Sep 21 15:15:34 2002 From: montanaro@users.sourceforge.net (Skip Montanaro) Date: Sat, 21 Sep 2002 07:15:34 -0700 Subject: [Spambayes-checkins] spambayes rebal.py,1.3,1.4 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv27918 Modified Files: rebal.py Log Message: provide a weak check against mixing ham and spam Index: rebal.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/rebal.py,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** rebal.py 14 Sep 2002 03:32:47 -0000 1.3 --- rebal.py 21 Sep 2002 14:15:30 -0000 1.4 *************** *** 127,130 **** --- 127,138 ---- return 1 + # weak check against mixing ham and spam + if ("Ham" in setpfx and "Spam" in resdir or + "Spam" in setpfx and "Ham" in resdir): + yn = raw_input("Reservoir and Set dirs appear not to match. " + "Continue? (y/n) ") + if yn.lower()[0:1] != 'y': + return 1 + # if necessary, migrate random files to the reservoir for (dir, fs) in stuff: From tim_one@users.sourceforge.net Sat Sep 21 21:25:52 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sat, 21 Sep 2002 13:25:52 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.21,1.22 classifier.py,1.16,1.17 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv18775 Modified Files: Options.py classifier.py Log Message: New option robinson_minimum_prob_strength. On my large test, and on small random-subset tests, setting this to 0.1 yields (and max_discriminators to 1500) a remarkable improvement in the f-n rate, even over what the all-default (Graham-like) scheme does. Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.21 retrieving revision 1.22 diff -C2 -d -r1.21 -r1.22 *** Options.py 21 Sep 2002 02:46:20 -0000 1.21 --- Options.py 21 Sep 2002 20:25:49 -0000 1.22 *************** *** 96,102 **** # yet any bound in sight for how low this can go (0.075 would work as # well as 0.90 on Tim's large c.l.py data). ! # For Gary Robinson's scheme, 0.50 works best for *us*. Other people ! # who have implemented Graham's scheme, and stuck to it in most respects, ! # report values closer to 0.70 working best for them. spam_cutoff: 0.90 --- 96,103 ---- # yet any bound in sight for how low this can go (0.075 would work as # well as 0.90 on Tim's large c.l.py data). ! # For Gary Robinson's scheme, some value between 0.50 and 0.60 has worked ! # best in all reports so far. Note that you can easily deduce the effect ! # of setting spam_cutoff to any particular value by studying the score ! # histograms -- there's no need to run a test again to see what would happen. spam_cutoff: 0.90 *************** *** 153,156 **** --- 154,158 ---- # Speculative options for Gary Robinson's ideas. These may go away, or # a bunch of incompatible stuff above may go away. + # CAUTION: evidence to date suggest setting spam_cutoff # Use Gary's scheme for combining probabilities. *************** *** 165,168 **** --- 167,184 ---- # Use Gary's scheme for ranking probabilities. use_robinson_ranking: False + + # When scoring a message, ignore all words with + # abs(word.spamprob - 0.5) < robinson_minimum_prob_strength. + # By default (0.0), nothing is ignored. + # Tim got a pretty clear improvement in f-n rate on his hasn't-improved-in- + # a-long-time large c.l.py test by using 0.1. No other values have been + # tried yet. + # Neil Schemenauer also reported good results from 0.1, making the all- + # Robinson scheme match the all-default Graham-like scheme on a smaller + # and different corpus. + # NOTE: Changing this may change the best spam_cutoff value for your + # corpus. Since one effect is to separate the means more, you'll probably + # want a higher spam_cutoff. + robinson_minimum_prob_strength: 0.0 """ *************** *** 207,210 **** --- 223,227 ---- 'robinson_probability_x': float_cracker, 'use_robinson_ranking': boolean_cracker, + 'robinson_minimum_prob_strength': float_cracker, }, } Index: classifier.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/classifier.py,v retrieving revision 1.16 retrieving revision 1.17 diff -C2 -d -r1.16 -r1.17 *** classifier.py 21 Sep 2002 03:43:13 -0000 1.16 --- classifier.py 21 Sep 2002 20:25:49 -0000 1.17 *************** *** 471,474 **** --- 471,475 ---- from math import frexp + mindist = options.robinson_minimum_prob_strength # A priority queue to remember the MAX_DISCRIMINATORS best *************** *** 489,493 **** distance = abs(prob - 0.5) ! if distance > smallest_best: heapreplace(nbest, (distance, prob, word, record)) smallest_best = nbest[0][0] --- 490,494 ---- distance = abs(prob - 0.5) ! if distance >= mindist and distance > smallest_best: heapreplace(nbest, (distance, prob, word, record)) smallest_best = nbest[0][0] From tim_one@users.sourceforge.net Sat Sep 21 22:11:52 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sat, 21 Sep 2002 14:11:52 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.22,1.23 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv30007 Modified Files: Options.py Log Message: Nuked a stray sentence fragment in a comment. Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.22 retrieving revision 1.23 diff -C2 -d -r1.22 -r1.23 *** Options.py 21 Sep 2002 20:25:49 -0000 1.22 --- Options.py 21 Sep 2002 21:11:50 -0000 1.23 *************** *** 154,158 **** # Speculative options for Gary Robinson's ideas. These may go away, or # a bunch of incompatible stuff above may go away. - # CAUTION: evidence to date suggest setting spam_cutoff # Use Gary's scheme for combining probabilities. --- 154,157 ---- From tim_one@users.sourceforge.net Sat Sep 21 22:19:43 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sat, 21 Sep 2002 14:19:43 -0700 Subject: [Spambayes-checkins] spambayes rebal.py,1.4,1.5 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv31474 Modified Files: rebal.py Log Message: Stopped making -Q imply -q: these are very different kinds of messages, and it wasn't at all clear from the docs that -Q would imply -q. Index: rebal.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/rebal.py,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** rebal.py 21 Sep 2002 14:15:30 -0000 1.4 --- rebal.py 21 Sep 2002 21:19:40 -0000 1.5 *************** *** 9,17 **** -r res - specify an alternate reservoir [%(RESDIR)s] -s set - specify an alternate Set pfx [%(SETPFX)s] ! -n num - specify number of files per dir [%(NPERDIR)s] -v - tell user what's happening [%(VERBOSE)s] -q - be quiet about what's happening [not %(VERBOSE)s] -c - confirm file moves into Set directory [%(CONFIRM)s] ! -Q - be quiet and don't confirm moves The script will work with a variable number of Set directories, but they --- 9,17 ---- -r res - specify an alternate reservoir [%(RESDIR)s] -s set - specify an alternate Set pfx [%(SETPFX)s] ! -n num - specify number of files per Set dir desired [%(NPERDIR)s] -v - tell user what's happening [%(VERBOSE)s] -q - be quiet about what's happening [not %(VERBOSE)s] -c - confirm file moves into Set directory [%(CONFIRM)s] ! -Q - don't confirm moves; this is independent of -v/-q The script will work with a variable number of Set directories, but they *************** *** 104,108 **** verbose = False elif opt == "-Q": ! verbose = confirm = False elif opt == "-h": usage() --- 104,108 ---- verbose = False elif opt == "-Q": ! confirm = False elif opt == "-h": usage() From gvanrossum@users.sourceforge.net Sun Sep 22 01:19:01 2002 From: gvanrossum@users.sourceforge.net (Guido van Rossum) Date: Sat, 21 Sep 2002 17:19:01 -0700 Subject: [Spambayes-checkins] spambayes rates.py,1.4,1.5 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv2806 Modified Files: rates.py Log Message: When basename.txt doesn't exist, try basename. Index: rates.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/rates.py,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** rates.py 14 Sep 2002 00:03:51 -0000 1.4 --- rates.py 22 Sep 2002 00:18:58 -0000 1.5 *************** *** 31,35 **** def doit(basename): ! ifile = file(basename + '.txt') interesting = filter(lambda line: line.startswith('-> '), ifile) ifile.close() --- 31,38 ---- def doit(basename): ! try: ! ifile = file(basename + '.txt') ! except IOError: ! ifile = file(basename) interesting = filter(lambda line: line.startswith('-> '), ifile) ifile.close() From tim_one@users.sourceforge.net Sun Sep 22 05:19:10 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sat, 21 Sep 2002 21:19:10 -0700 Subject: [Spambayes-checkins] spambayes rates.py,1.5,1.6 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv9977 Modified Files: rates.py Log Message: Brought the module docstring back into line with the truth. Got rid of some unused computations. Index: rates.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/rates.py,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** rates.py 22 Sep 2002 00:18:58 -0000 1.5 --- rates.py 22 Sep 2002 04:19:08 -0000 1.6 *************** *** 7,18 **** basename + '.txt' ! contains output from timtest.py, scans that file for summary statistics, ! displays them to stdout, and also writes them to file basename + 's.txt' ! (where the 's' means 'summary'). This doesn't need a full output file, and ! will display stuff for as far as the output file has gotten so far. Two of these summary files can later be fed to cmp.py. --- 7,22 ---- basename + '.txt' + or + basename ! contains output from one of the test drivers (timcv, mboxtest, timtest), ! scans that file for summary statistics, displays them to stdout, and also ! writes them to file basename + 's.txt' ! (where the 's' means 'summary'). This doesn't need a full output file ! from a test run, and will display stuff for as far as the output file ! has gotten so far. Two of these summary files can later be fed to cmp.py. *************** *** 49,53 **** ntests = nfn = nfp = 0 sumfnrate = sumfprate = 0.0 - ntrainedham = ntrainedspam = 0 for line in interesting: --- 53,56 ---- *************** *** 58,63 **** #-> tested 4000 hams & 2750 spams against 8000 hams & 5500 spams if line.startswith('-> tested '): - ntrainedham += int(fields[-5]) - ntrainedspam += int(fields[-2]) ntests += 1 continue --- 61,64 ---- From tim_one@users.sourceforge.net Sun Sep 22 05:59:56 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sat, 21 Sep 2002 21:59:56 -0700 Subject: [Spambayes-checkins] spambayes LICENSE.txt,NONE,1.1 README.txt,1.21,1.22 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv16347 Modified Files: README.txt Added Files: LICENSE.txt Log Message: Added a simplified version of the PSF license to the project, and asserted copyright for the PSF. "Simplified" means got rid of references to Python, and dropped the stack of BeOpen/CNRI/CWI licenses (they clearly have no claim on *this* software). --- NEW FILE: LICENSE.txt --- Copyright (C) 2002 Python Software Foundation; All Rights Reserved The Python Software Foundation (PSF) holds copyright on all material in this project. You may use it under the terms of the PSF license: PSF LICENSE AGREEMENT FOR THE SPAMBAYES PROJECT ----------------------------------------------- 1. This LICENSE AGREEMENT is between the Python Software Foundation ("PSF"), and the Individual or Organization ("Licensee") accessing and otherwise using the spambayes software ("Software") in source or binary form and its associated documentation. 2. Subject to the terms and conditions of this License Agreement, PSF hereby grants Licensee a nonexclusive, royalty-free, world-wide license to reproduce, analyze, test, perform and/or display publicly, prepare derivative works, distribute, and otherwise use the Software alone or in any derivative version, provided, however, that PSF's License Agreement and PSF's notice of copyright, i.e., "Copyright (c) 2002 Python Software Foundation; All Rights Reserved" are retained the Software alone or in any derivative version prepared by Licensee. 3. In the event Licensee prepares a derivative work that is based on or incorporates the Software or any part thereof, and wants to make the derivative work available to others as provided herein, then Licensee hereby agrees to include in any such work a brief summary of the changes made to the Software. 4. PSF is making the Software available to Licensee on an "AS IS" basis. PSF MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, PSF MAKES NO AND DISCLAIMS ANY REPRESENTATION OR WARRANTY OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF THE SOFTWARE WILL NOT INFRINGE ANY THIRD PARTY RIGHTS. 5. PSF SHALL NOT BE LIABLE TO LICENSEE OR ANY OTHER USERS OF THE SOFTWARE FOR ANY INCIDENTAL, SPECIAL, OR CONSEQUENTIAL DAMAGES OR LOSS AS A RESULT OF MODIFYING, DISTRIBUTING, OR OTHERWISE USING THE SOFTWARE, OR ANY DERIVATIVE THEREOF, EVEN IF ADVISED OF THE POSSIBILITY THEREOF. 6. This License Agreement will automatically terminate upon a material breach of its terms and conditions. 7. Nothing in this License Agreement shall be deemed to create any relationship of agency, partnership, or joint venture between PSF and Licensee. This License Agreement does not grant permission to use PSF trademarks or trade name in a trademark sense to endorse or promote products or services of Licensee, or any third party. 8. By copying, installing or otherwise using the Software, Licensee agrees to be bound by the terms and conditions of this License Agreement. Index: README.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/README.txt,v retrieving revision 1.21 retrieving revision 1.22 diff -C2 -d -r1.21 -r1.22 *** README.txt 20 Sep 2002 03:15:13 -0000 1.21 --- README.txt 22 Sep 2002 04:59:54 -0000 1.22 *************** *** 1,2 **** --- 1,9 ---- + Copyright (C) 2002 Python Software Foundation; All Rights Reserved + + The Python Software Foundation (PSF) holds copyright on all material + in this project. You may use it under the terms of the PSF license; + see LICENSE.txt. + + Assorted clues. *************** *** 70,74 **** spam and non-spam mail. The database in intended for use with neilfilter.py. ! neilfilter.py A delivery agent that uses the CDB created by neiltrain.py and --- 77,81 ---- spam and non-spam mail. The database in intended for use with neilfilter.py. ! neilfilter.py A delivery agent that uses the CDB created by neiltrain.py and From gvanrossum@users.sourceforge.net Sun Sep 22 07:58:38 2002 From: gvanrossum@users.sourceforge.net (Guido van Rossum) Date: Sat, 21 Sep 2002 23:58:38 -0700 Subject: [Spambayes-checkins] spambayes heapq.py,NONE,1.1 sets.py,NONE,1.1 TestDriver.py,1.4,1.5 cdb.py,1.2,1.3 hammie.py,1.18,1.19 mboxtest.py,1.7,1.8 timcv.py,1.6,1.7 timtest.py,1.26,1.27 tokenizer.py,1.29,1.30 unheader.py,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv32550 Modified Files: TestDriver.py cdb.py hammie.py mboxtest.py timcv.py timtest.py tokenizer.py unheader.py Added Files: heapq.py sets.py Log Message: Make this Python 2.2.1-compatible[*], by: - adding "from __future__ import generators" to all files using 'yield' in 6 files; - spelling out "for i, x in enumerate(s)" using range(len(s)) etc. in 2 files; - in tokenizer.py, changing get_content_maintype() and get_content_type() into get_main_type('text') and get_type('text/plain'), respectively (the defaults are necessary because these older APIs default to None rather than to text/plain as they should in most contexts. [**] I haven't tried to run all tools, but I've tried timcv.py, rates.py and cmp.py. This invokes most code that Tim wrote. I grepped for enumerate() and yield. [*] But not Python 2.2-compatible. There are too many places using True or False (none using bool() though). [**] XXX should a 'text/plain' default be added to other uses of get_type() in tokenizer.py? The default is None, and I see one place that asks "if part.get_type() == 'text/plain'". --- NEW FILE: heapq.py --- # -*- coding: Latin-1 -*- """Heap queue algorithm (a.k.a. priority queue). Heaps are arrays for which a[k] <= a[2*k+1] and a[k] <= a[2*k+2] for all k, counting elements from 0. For the sake of comparison, non-existing elements are considered to be infinite. The interesting property of a heap is that a[0] is always its smallest element. Usage: heap = [] # creates an empty heap heappush(heap, item) # pushes a new item on the heap item = heappop(heap) # pops the smallest item from the heap item = heap[0] # smallest item on the heap without popping it heapify(x) # transforms list into a heap, in-place, in linear time item = heapreplace(heap, item) # pops and returns smallest item, and adds # new item; the heap size is unchanged Our API differs from textbook heap algorithms as follows: - We use 0-based indexing. This makes the relationship between the index for a node and the indexes for its children slightly less obvious, but is more suitable since Python uses 0-based indexing. - Our heappop() method returns the smallest item, not the largest. These two make it possible to view the heap as a regular Python list without surprises: heap[0] is the smallest item, and heap.sort() maintains the heap invariant! """ # Original code by Kevin O'Connor, augmented by Tim Peters __about__ = """Heap queues [explanation by Fran�ois Pinard] Heaps are arrays for which a[k] <= a[2*k+1] and a[k] <= a[2*k+2] for all k, counting elements from 0. For the sake of comparison, non-existing elements are considered to be infinite. The interesting property of a heap is that a[0] is always its smallest element. The strange invariant above is meant to be an efficient memory representation for a tournament. The numbers below are `k', not a[k]: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 In the tree above, each cell `k' is topping `2*k+1' and `2*k+2'. In an usual binary tournament we see in sports, each cell is the winner over the two cells it tops, and we can trace the winner down the tree to see all opponents s/he had. However, in many computer applications of such tournaments, we do not need to trace the history of a winner. To be more memory efficient, when a winner is promoted, we try to replace it by something else at a lower level, and the rule becomes that a cell and the two cells it tops contain three different items, but the top cell "wins" over the two topped cells. If this heap invariant is protected at all time, index 0 is clearly the overall winner. The simplest algorithmic way to remove it and find the "next" winner is to move some loser (let's say cell 30 in the diagram above) into the 0 position, and then percolate this new 0 down the tree, exchanging values, until the invariant is re-established. This is clearly logarithmic on the total number of items in the tree. By iterating over all items, you get an O(n ln n) sort. A nice feature of this sort is that you can efficiently insert new items while the sort is going on, provided that the inserted items are not "better" than the last 0'th element you extracted. This is especially useful in simulation contexts, where the tree holds all incoming events, and the "win" condition means the smallest scheduled time. When an event schedule other events for execution, they are scheduled into the future, so they can easily go into the heap. So, a heap is a good structure for implementing schedulers (this is what I used for my MIDI sequencer :-). Various structures for implementing schedulers have been extensively studied, and heaps are good for this, as they are reasonably speedy, the speed is almost constant, and the worst case is not much different than the average case. However, there are other representations which are more efficient overall, yet the worst cases might be terrible. Heaps are also very useful in big disk sorts. You most probably all know that a big sort implies producing "runs" (which are pre-sorted sequences, which size is usually related to the amount of CPU memory), followed by a merging passes for these runs, which merging is often very cleverly organised[1]. It is very important that the initial sort produces the longest runs possible. Tournaments are a good way to that. If, using all the memory available to hold a tournament, you replace and percolate items that happen to fit the current run, you'll produce runs which are twice the size of the memory for random input, and much better for input fuzzily ordered. Moreover, if you output the 0'th item on disk and get an input which may not fit in the current tournament (because the value "wins" over the last output value), it cannot fit in the heap, so the size of the heap decreases. The freed memory could be cleverly reused immediately for progressively building a second heap, which grows at exactly the same rate the first heap is melting. When the first heap completely vanishes, you switch heaps and start a new run. Clever and quite effective! In a word, heaps are useful memory structures to know. I use them in a few applications, and I think it is good to keep a `heap' module around. :-) -------------------- [1] The disk balancing algorithms which are current, nowadays, are more annoying than clever, and this is a consequence of the seeking capabilities of the disks. On devices which cannot seek, like big tape drives, the story was quite different, and one had to be very clever to ensure (far in advance) that each tape movement will be the most effective possible (that is, will best participate at "progressing" the merge). Some tapes were even able to read backwards, and this was also used to avoid the rewinding time. Believe me, real good tape sorts were quite spectacular to watch! >From all times, sorting has always been a Great Art! :-) """ def heappush(heap, item): """Push item onto heap, maintaining the heap invariant.""" heap.append(item) _siftdown(heap, 0, len(heap)-1) def heappop(heap): """Pop the smallest item off the heap, maintaining the heap invariant.""" lastelt = heap.pop() # raises appropriate IndexError if heap is empty if heap: returnitem = heap[0] heap[0] = lastelt _siftup(heap, 0) else: returnitem = lastelt return returnitem def heapreplace(heap, item): """Pop and return the current smallest value, and add the new item. This is more efficient than heappop() followed by heappush(), and can be more appropriate when using a fixed-size heap. Note that the value returned may be larger than item! That constrains reasonable uses of this routine. """ returnitem = heap[0] # raises appropriate IndexError if heap is empty heap[0] = item _siftup(heap, 0) return returnitem def heapify(x): """Transform list into a heap, in-place, in O(len(heap)) time.""" n = len(x) # Transform bottom-up. The largest index there's any point to looking at # is the largest with a child index in-range, so must have 2*i + 1 < n, # or i < (n-1)/2. If n is even = 2*j, this is (2*j-1)/2 = j-1/2 so # j-1 is the largest, which is n//2 - 1. If n is odd = 2*j+1, this is # (2*j+1-1)/2 = j so j-1 is the largest, and that's again n//2-1. for i in xrange(n//2 - 1, -1, -1): _siftup(x, i) # 'heap' is a heap at all indices >= startpos, except possibly for pos. pos # is the index of a leaf with a possibly out-of-order value. Restore the # heap invariant. def _siftdown(heap, startpos, pos): newitem = heap[pos] # Follow the path to the root, moving parents down until finding a place # newitem fits. while pos > startpos: parentpos = (pos - 1) >> 1 parent = heap[parentpos] if parent <= newitem: break heap[pos] = parent pos = parentpos heap[pos] = newitem # The child indices of heap index pos are already heaps, and we want to make # a heap at index pos too. We do this by bubbling the smaller child of # pos up (and so on with that child's children, etc) until hitting a leaf, # then using _siftdown to move the oddball originally at index pos into place. # # We *could* break out of the loop as soon as we find a pos where newitem <= # both its children, but turns out that's not a good idea, and despite that # many books write the algorithm that way. During a heap pop, the last array # element is sifted in, and that tends to be large, so that comparing it # against values starting from the root usually doesn't pay (= usually doesn't # get us out of the loop early). See Knuth, Volume 3, where this is # explained and quantified in an exercise. # # Cutting the # of comparisons is important, since these routines have no # way to extract "the priority" from an array element, so that intelligence # is likely to be hiding in custom __cmp__ methods, or in array elements # storing (priority, record) tuples. Comparisons are thus potentially # expensive. # # On random arrays of length 1000, making this change cut the number of # comparisons made by heapify() a little, and those made by exhaustive # heappop() a lot, in accord with theory. Here are typical results from 3 # runs (3 just to demonstrate how small the variance is): # # Compares needed by heapify Compares needed by 1000 heapppops # -------------------------- --------------------------------- # 1837 cut to 1663 14996 cut to 8680 # 1855 cut to 1659 14966 cut to 8678 # 1847 cut to 1660 15024 cut to 8703 # # Building the heap by using heappush() 1000 times instead required # 2198, 2148, and 2219 compares: heapify() is more efficient, when # you can use it. # # The total compares needed by list.sort() on the same lists were 8627, # 8627, and 8632 (this should be compared to the sum of heapify() and # heappop() compares): list.sort() is (unsurprisingly!) more efficient # for sorting. def _siftup(heap, pos): endpos = len(heap) startpos = pos newitem = heap[pos] # Bubble up the smaller child until hitting a leaf. childpos = 2*pos + 1 # leftmost child position while childpos < endpos: # Set childpos to index of smaller child. rightpos = childpos + 1 if rightpos < endpos and heap[rightpos] <= heap[childpos]: childpos = rightpos # Move the smaller child up. heap[pos] = heap[childpos] pos = childpos childpos = 2*pos + 1 # The leaf at pos is empty now. Put newitem there, and and bubble it up # to its final resting place (by sifting its parents down). heap[pos] = newitem _siftdown(heap, startpos, pos) if __name__ == "__main__": # Simple sanity test heap = [] data = [1, 3, 5, 7, 9, 2, 4, 6, 8, 0] for item in data: heappush(heap, item) sort = [] while heap: sort.append(heappop(heap)) print sort --- NEW FILE: sets.py --- """Classes to represent arbitrary sets (including sets of sets). This module implements sets using dictionaries whose values are ignored. The usual operations (union, intersection, deletion, etc.) are provided as both methods and operators. Important: sets are not sequences! While they support 'x in s', 'len(s)', and 'for x in s', none of those operations are unique for sequences; for example, mappings support all three as well. The characteristic operation for sequences is subscripting with small integers: s[i], for i in range(len(s)). Sets don't support subscripting at all. Also, sequences allow multiple occurrences and their elements have a definite order; sets on the other hand don't record multiple occurrences and don't remember the order of element insertion (which is why they don't support s[i]). The following classes are provided: BaseSet -- All the operations common to both mutable and immutable sets. This is an abstract class, not meant to be directly instantiated. Set -- Mutable sets, subclass of BaseSet; not hashable. ImmutableSet -- Immutable sets, subclass of BaseSet; hashable. An iterable argument is mandatory to create an ImmutableSet. _TemporarilyImmutableSet -- Not a subclass of BaseSet: just a wrapper around a Set, hashable, giving the same hash value as the immutable set equivalent would have. Do not use this class directly. Only hashable objects can be added to a Set. In particular, you cannot really add a Set as an element to another Set; if you try, what is actually added is an ImmutableSet built from it (it compares equal to the one you tried adding). When you ask if `x in y' where x is a Set and y is a Set or ImmutableSet, x is wrapped into a _TemporarilyImmutableSet z, and what's tested is actually `z in y'. """ # Code history: # # - Greg V. Wilson wrote the first version, using a different approach # to the mutable/immutable problem, and inheriting from dict. # # - Alex Martelli modified Greg's version to implement the current # Set/ImmutableSet approach, and make the data an attribute. # # - Guido van Rossum rewrote much of the code, made some API changes, # and cleaned up the docstrings. # # - Raymond Hettinger added a number of speedups and other # improvements. __all__ = ['BaseSet', 'Set', 'ImmutableSet'] class BaseSet(object): """Common base class for mutable and immutable sets.""" __slots__ = ['_data'] # Constructor def __init__(self): """This is an abstract class.""" # Don't call this from a concrete subclass! if self.__class__ is BaseSet: raise TypeError, ("BaseSet is an abstract class. " "Use Set or ImmutableSet.") # Standard protocols: __len__, __repr__, __str__, __iter__ def __len__(self): """Return the number of elements of a set.""" return len(self._data) def __repr__(self): """Return string representation of a set. This looks like 'Set([])'. """ return self._repr() # __str__ is the same as __repr__ __str__ = __repr__ def _repr(self, sorted=False): elements = self._data.keys() if sorted: elements.sort() return '%s(%r)' % (self.__class__.__name__, elements) def __iter__(self): """Return an iterator over the elements or a set. This is the keys iterator for the underlying dict. """ return self._data.iterkeys() # Equality comparisons using the underlying dicts def __eq__(self, other): self._binary_sanity_check(other) return self._data == other._data def __ne__(self, other): self._binary_sanity_check(other) return self._data != other._data # Copying operations def copy(self): """Return a shallow copy of a set.""" result = self.__class__() result._data.update(self._data) return result __copy__ = copy # For the copy module def __deepcopy__(self, memo): """Return a deep copy of a set; used by copy module.""" # This pre-creates the result and inserts it in the memo # early, in case the deep copy recurses into another reference # to this same set. A set can't be an element of itself, but # it can certainly contain an object that has a reference to # itself. from copy import deepcopy result = self.__class__() memo[id(self)] = result data = result._data value = True for elt in self: data[deepcopy(elt, memo)] = value return result # Standard set operations: union, intersection, both differences. # Each has an operator version (e.g. __or__, invoked with |) and a # method version (e.g. union). # Subtle: Each pair requires distinct code so that the outcome is # correct when the type of other isn't suitable. For example, if # we did "union = __or__" instead, then Set().union(3) would return # NotImplemented instead of raising TypeError (albeit that *why* it # raises TypeError as-is is also a bit subtle). def __or__(self, other): """Return the union of two sets as a new set. (I.e. all elements that are in either set.) """ if not isinstance(other, BaseSet): return NotImplemented result = self.__class__() result._data = self._data.copy() result._data.update(other._data) return result def union(self, other): """Return the union of two sets as a new set. (I.e. all elements that are in either set.) """ return self | other def __and__(self, other): """Return the intersection of two sets as a new set. (I.e. all elements that are in both sets.) """ if not isinstance(other, BaseSet): return NotImplemented if len(self) <= len(other): little, big = self, other else: little, big = other, self common = filter(big._data.has_key, little._data.iterkeys()) return self.__class__(common) def intersection(self, other): """Return the intersection of two sets as a new set. (I.e. all elements that are in both sets.) """ return self & other def __xor__(self, other): """Return the symmetric difference of two sets as a new set. (I.e. all elements that are in exactly one of the sets.) """ if not isinstance(other, BaseSet): return NotImplemented result = self.__class__() data = result._data value = True selfdata = self._data otherdata = other._data for elt in selfdata: if elt not in otherdata: data[elt] = value for elt in otherdata: if elt not in selfdata: data[elt] = value return result def symmetric_difference(self, other): """Return the symmetric difference of two sets as a new set. (I.e. all elements that are in exactly one of the sets.) """ return self ^ other def __sub__(self, other): """Return the difference of two sets as a new Set. (I.e. all elements that are in this set and not in the other.) """ if not isinstance(other, BaseSet): return NotImplemented result = self.__class__() data = result._data otherdata = other._data value = True for elt in self: if elt not in otherdata: data[elt] = value return result def difference(self, other): """Return the difference of two sets as a new Set. (I.e. all elements that are in this set and not in the other.) """ return self - other # Membership test def __contains__(self, element): """Report whether an element is a member of a set. (Called in response to the expression `element in self'.) """ try: return element in self._data except TypeError: transform = getattr(element, "_as_temporarily_immutable", None) if transform is None: raise # re-raise the TypeError exception we caught return transform() in self._data # Subset and superset test def issubset(self, other): """Report whether another set contains this set.""" self._binary_sanity_check(other) if len(self) > len(other): # Fast check for obvious cases return False otherdata = other._data for elt in self: if elt not in otherdata: return False return True def issuperset(self, other): """Report whether this set contains another set.""" self._binary_sanity_check(other) if len(self) < len(other): # Fast check for obvious cases return False selfdata = self._data for elt in other: if elt not in selfdata: return False return True # Inequality comparisons using the is-subset relation. __le__ = issubset __ge__ = issuperset def __lt__(self, other): self._binary_sanity_check(other) return len(self) < len(other) and self.issubset(other) def __gt__(self, other): self._binary_sanity_check(other) return len(self) > len(other) and self.issuperset(other) # Assorted helpers def _binary_sanity_check(self, other): # Check that the other argument to a binary operation is also # a set, raising a TypeError otherwise. if not isinstance(other, BaseSet): raise TypeError, "Binary operation only permitted between sets" def _compute_hash(self): # Calculate hash code for a set by xor'ing the hash codes of # the elements. This ensures that the hash code does not depend # on the order in which elements are added to the set. This is # not called __hash__ because a BaseSet should not be hashable; # only an ImmutableSet is hashable. result = 0 for elt in self: result ^= hash(elt) return result def _update(self, iterable): # The main loop for update() and the subclass __init__() methods. data = self._data # Use the fast update() method when a dictionary is available. if isinstance(iterable, BaseSet): data.update(iterable._data) return if isinstance(iterable, dict): data.update(iterable) return value = True it = iter(iterable) while True: try: for element in it: data[element] = value return except TypeError: transform = getattr(element, "_as_immutable", None) if transform is None: raise # re-raise the TypeError exception we caught data[transform()] = value class ImmutableSet(BaseSet): """Immutable set class.""" __slots__ = ['_hashcode'] # BaseSet + hashing def __init__(self, iterable=None): """Construct an immutable set from an optional iterable.""" self._hashcode = None self._data = {} if iterable is not None: self._update(iterable) def __hash__(self): if self._hashcode is None: self._hashcode = self._compute_hash() return self._hashcode class Set(BaseSet): """ Mutable set class.""" __slots__ = [] # BaseSet + operations requiring mutability; no hashing def __init__(self, iterable=None): """Construct a set from an optional iterable.""" self._data = {} if iterable is not None: self._update(iterable) def __hash__(self): """A Set cannot be hashed.""" # We inherit object.__hash__, so we must deny this explicitly raise TypeError, "Can't hash a Set, only an ImmutableSet." # In-place union, intersection, differences. # Subtle: The xyz_update() functions deliberately return None, # as do all mutating operations on built-in container types. # The __xyz__ spellings have to return self, though. def __ior__(self, other): """Update a set with the union of itself and another.""" self._binary_sanity_check(other) self._data.update(other._data) return self def union_update(self, other): """Update a set with the union of itself and another.""" self |= other def __iand__(self, other): """Update a set with the intersection of itself and another.""" self._binary_sanity_check(other) self._data = (self & other)._data return self def intersection_update(self, other): """Update a set with the intersection of itself and another.""" self &= other def __ixor__(self, other): """Update a set with the symmetric difference of itself and another.""" self._binary_sanity_check(other) data = self._data value = True for elt in other: if elt in data: del data[elt] else: data[elt] = value return self def symmetric_difference_update(self, other): """Update a set with the symmetric difference of itself and another.""" self ^= other def __isub__(self, other): """Remove all elements of another set from this set.""" self._binary_sanity_check(other) data = self._data for elt in other: if elt in data: del data[elt] return self def difference_update(self, other): """Remove all elements of another set from this set.""" self -= other # Python dict-like mass mutations: update, clear def update(self, iterable): """Add all values from an iterable (such as a list or file).""" self._update(iterable) def clear(self): """Remove all elements from this set.""" self._data.clear() # Single-element mutations: add, remove, discard def add(self, element): """Add an element to a set. This has no effect if the element is already present. """ try: self._data[element] = True except TypeError: transform = getattr(element, "_as_immutable", None) if transform is None: raise # re-raise the TypeError exception we caught self._data[transform()] = True def remove(self, element): """Remove an element from a set; it must be a member. If the element is not a member, raise a KeyError. """ try: del self._data[element] except TypeError: transform = getattr(element, "_as_temporarily_immutable", None) if transform is None: raise # re-raise the TypeError exception we caught del self._data[transform()] def discard(self, element): """Remove an element from a set if it is a member. If the element is not a member, do nothing. """ try: self.remove(element) except KeyError: pass def pop(self): """Remove and return an arbitrary set element.""" return self._data.popitem()[0] def _as_immutable(self): # Return a copy of self as an immutable set return ImmutableSet(self) def _as_temporarily_immutable(self): # Return self wrapped in a temporarily immutable set return _TemporarilyImmutableSet(self) class _TemporarilyImmutableSet(BaseSet): # Wrap a mutable set as if it was temporarily immutable. # This only supplies hashing and equality comparisons. def __init__(self, set): self._set = set self._data = set._data # Needed by ImmutableSet.__eq__() def __hash__(self): return self._set._compute_hash() Index: TestDriver.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** TestDriver.py 14 Sep 2002 00:03:51 -0000 1.4 --- TestDriver.py 22 Sep 2002 06:58:36 -0000 1.5 *************** *** 61,65 **** format = "%6.2f %" + str(ndigits) + "d" ! for i, n in enumerate(self.buckets): print format % (100.0 * i / self.nbuckets, n), print '*' * ((n + hunit - 1) // hunit) --- 61,66 ---- format = "%6.2f %" + str(ndigits) + "d" ! for i in range(len(self.buckets)): ! n = self.buckets[i] print format % (100.0 * i / self.nbuckets, n), print '*' * ((n + hunit - 1) // hunit) Index: cdb.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/cdb.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** cdb.py 11 Sep 2002 06:21:22 -0000 1.2 --- cdb.py 22 Sep 2002 06:58:36 -0000 1.3 *************** *** 6,9 **** --- 6,12 ---- """ + + from __future__ import generators + import os import struct Index: hammie.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammie.py,v retrieving revision 1.18 retrieving revision 1.19 diff -C2 -d -r1.18 -r1.19 *** hammie.py 20 Sep 2002 19:30:52 -0000 1.18 --- hammie.py 22 Sep 2002 06:58:36 -0000 1.19 *************** *** 28,31 **** --- 28,33 ---- """ + from __future__ import generators + import sys import os Index: mboxtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/mboxtest.py,v retrieving revision 1.7 retrieving revision 1.8 diff -C2 -d -r1.7 -r1.8 *** mboxtest.py 17 Sep 2002 17:57:39 -0000 1.7 --- mboxtest.py 22 Sep 2002 06:58:36 -0000 1.8 *************** *** 19,22 **** --- 19,24 ---- """ + from __future__ import generators + import getopt import mailbox Index: timcv.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timcv.py,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** timcv.py 14 Sep 2002 22:01:42 -0000 1.6 --- timcv.py 22 Sep 2002 06:58:36 -0000 1.7 *************** *** 32,35 **** --- 32,37 ---- """ + from __future__ import generators + import os import sys Index: timtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timtest.py,v retrieving revision 1.26 retrieving revision 1.27 diff -C2 -d -r1.26 -r1.27 *** timtest.py 14 Sep 2002 00:03:51 -0000 1.26 --- timtest.py 22 Sep 2002 06:58:36 -0000 1.27 *************** *** 18,21 **** --- 18,23 ---- """ + from __future__ import generators + import os import sys Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.29 retrieving revision 1.30 diff -C2 -d -r1.29 -r1.30 *** tokenizer.py 20 Sep 2002 06:18:24 -0000 1.29 --- tokenizer.py 22 Sep 2002 06:58:36 -0000 1.30 *************** *** 2,5 **** --- 2,7 ---- """Module to tokenize email messages for spam filtering.""" + from __future__ import generators + import email import re *************** *** 513,517 **** redundant_html = Set() for part in msg.walk(): ! if part.get_content_type() == 'multipart/alternative': # Descend this part of the tree, adding any redundant HTML text # part to redundant_html. --- 515,519 ---- redundant_html = Set() for part in msg.walk(): ! if part.get_type() == 'multipart/alternative': # Descend this part of the tree, adding any redundant HTML text # part to redundant_html. *************** *** 520,524 **** while stack: subpart = stack.pop() ! ctype = subpart.get_content_type() if ctype == 'text/plain': textpart = subpart --- 522,526 ---- while stack: subpart = stack.pop() ! ctype = subpart.get_type('text/plain') if ctype == 'text/plain': textpart = subpart *************** *** 535,539 **** text.add(htmlpart) ! elif part.get_content_maintype() == 'text': text.add(part) --- 537,541 ---- text.add(htmlpart) ! elif part.get_main_type('text') == 'text': text.add(part) *************** *** 544,548 **** # have redundant content, so it goes. def textparts(msg): ! return Set(filter(lambda part: part.get_content_maintype() == 'text', msg.walk())) --- 546,550 ---- # have redundant content, so it goes. def textparts(msg): ! return Set(filter(lambda part: part.get_main_type('text') == 'text', msg.walk())) *************** *** 1019,1023 **** # Remove HTML/XML tags. ! if (part.get_content_type() == "text/plain" or not options.retain_pure_html_tags): text = html_re.sub(' ', text) --- 1021,1025 ---- # Remove HTML/XML tags. ! if (part.get_type() == "text/plain" or not options.retain_pure_html_tags): text = html_re.sub(' ', text) Index: unheader.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/unheader.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** unheader.py 7 Sep 2002 05:50:42 -0000 1.1 --- unheader.py 22 Sep 2002 06:58:36 -0000 1.2 *************** *** 18,22 **** """replace first value for hdr with newval""" hdr = hdr.lower() ! for (i, (k, v)) in enumerate(self._headers): if k.lower() == hdr: self._headers[i] = (k, newval) --- 18,23 ---- """replace first value for hdr with newval""" hdr = hdr.lower() ! for i in range(len(self._headers)): ! k, v = self._headers[i] if k.lower() == hdr: self._headers[i] = (k, newval) From tim_one@users.sourceforge.net Sun Sep 22 08:45:30 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sun, 22 Sep 2002 00:45:30 -0700 Subject: [Spambayes-checkins] spambayes rebal.py,1.5,1.6 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv7977 Modified Files: rebal.py Log Message: Removed use of 2.3 "string in string"-ism. Index: rebal.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/rebal.py,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** rebal.py 21 Sep 2002 21:19:40 -0000 1.5 --- rebal.py 22 Sep 2002 07:45:27 -0000 1.6 *************** *** 128,133 **** # weak check against mixing ham and spam ! if ("Ham" in setpfx and "Spam" in resdir or ! "Spam" in setpfx and "Ham" in resdir): yn = raw_input("Reservoir and Set dirs appear not to match. " "Continue? (y/n) ") --- 128,133 ---- # weak check against mixing ham and spam ! if (setpfx.find("Ham") >= 0 and resdir.find("Spam") >= 0 or ! setpfx.find("Spam") >= 0 and resdir.find("Ham") >= 0): yn = raw_input("Reservoir and Set dirs appear not to match. " "Continue? (y/n) ") From anthonybaxter@users.sourceforge.net Sun Sep 22 08:48:05 2002 From: anthonybaxter@users.sourceforge.net (Anthony Baxter) Date: Sun, 22 Sep 2002 00:48:05 -0700 Subject: [Spambayes-checkins] website developer.ht,1.2,1.3 Message-ID: Update of /cvsroot/spambayes/website In directory usw-pr-cvs1:/tmp/cvs-serv8235 Modified Files: developer.ht Log Message: 2.2.1 now supported. Index: developer.ht =================================================================== RCS file: /cvsroot/spambayes/website/developer.ht,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** developer.ht 19 Sep 2002 08:57:58 -0000 1.2 --- developer.ht 22 Sep 2002 07:48:03 -0000 1.3 *************** *** 12,16 **** come crying <wink>.
!
This project uses the absolute bleeding edge of python code, available from CVS on sourceforge.

The spambayes code itself is also available via CVS --- 12,16 ---- come crying <wink>.
!
This project works with either the absolute bleeding edge of python code, available from CVS on sourceforge, or with Python 2.2.1 (not 2.2, or 2.1.3).

The spambayes code itself is also available via CVS From tim_one@users.sourceforge.net Sun Sep 22 09:31:50 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sun, 22 Sep 2002 01:31:50 -0700 Subject: [Spambayes-checkins] spambayes TestDriver.py,1.5,1.6 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv15892 Modified Files: TestDriver.py Log Message: Augmented the Hist class to compute and display mean and (sample) sdev. Index: TestDriver.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** TestDriver.py 22 Sep 2002 06:58:36 -0000 1.5 --- TestDriver.py 22 Sep 2002 08:31:48 -0000 1.6 *************** *** 36,39 **** --- 36,42 ---- self.buckets = [0] * nbuckets self.nbuckets = nbuckets + self.n = 0 # number of data points + self.sum = 0.0 # sum of their values + self.sumsq = 0.0 # sum of their squares def add(self, x): *************** *** 44,47 **** --- 47,55 ---- self.buckets[i] += 1 + self.n += 1 + x *= 100.0 + self.sum += x + self.sumsq += x*x + def __iadd__(self, other): if self.nbuckets != other.nbuckets: *************** *** 49,55 **** --- 57,77 ---- for i in range(self.nbuckets): self.buckets[i] += other.buckets[i] + self.n += other.n + self.sum += other.sum + self.sumsq += other.sumsq return self def display(self, WIDTH=60): + from math import sqrt + if self.n > 1: + mean = self.sum / self.n + # sum (x_i - mean)**2 = sum (x_i**2 - 2*x_i*mean + mean**2) = + # sum x_i**2 - 2*mean*sum x_i + sum mean**2 = + # sum x_i**2 - 2*mean*mean*n + n*mean**2 = + # sum x_i**2 - n*mean**2 + samplevar = (self.sumsq - self.n * mean**2) / (self.n - 1) + print "%d items; mean %.2f; sample sdev %.2f" % (self.n, + mean, sqrt(samplevar)) + biggest = max(self.buckets) hunit, r = divmod(biggest, WIDTH) From montanaro@users.sourceforge.net Mon Sep 23 04:13:33 2002 From: montanaro@users.sourceforge.net (Skip Montanaro) Date: Sun, 22 Sep 2002 20:13:33 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.23,1.24 tokenizer.py,1.30,1.31 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv1989 Modified Files: Options.py tokenizer.py Log Message: Added two new options: check_octets and octet_prefix_size. If check_octets is True, any application/octet-stream parts will be tokenized simply by returning octet_prefix_size bytes of the first line of the base64-encoded stuff. For example, DOS/Windows executables seem to begin with the string "TVqQA". If enabled, the token "octet:TVqQA" would be returned for such sections, providing they had the appropriate content type and transfer encoding. By default, check_octets is False, preserving preexisting behavior. I can't test this very well since I've pretty ruthlessly purged viruses from my Spam corpu. Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.23 retrieving revision 1.24 diff -C2 -d -r1.23 -r1.24 *** Options.py 21 Sep 2002 21:11:50 -0000 1.23 --- Options.py 23 Sep 2002 03:13:30 -0000 1.24 *************** *** 50,53 **** --- 50,58 ---- ignore_redundant_html: False + # If true, the first few characters of application/octet-stream sections + # are used, undecoded. What 'few' means is decided by octet_prefix_size. + check_octets: False + octet_prefix_size: 5 + # Generate tokens just counting the number of instances of each kind of # header line, in a case-sensitive way. *************** *** 193,196 **** --- 198,203 ---- 'count_all_header_lines': boolean_cracker, 'mine_received_headers': boolean_cracker, + 'check_octets': boolean_cracker, + 'octet_prefix_size': int_cracker, 'basic_header_tokenize': boolean_cracker, 'basic_header_tokenize_only': boolean_cracker, Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.30 retrieving revision 1.31 diff -C2 -d -r1.30 -r1.31 *** tokenizer.py 22 Sep 2002 06:58:36 -0000 1.30 --- tokenizer.py 23 Sep 2002 03:13:31 -0000 1.31 *************** *** 549,552 **** --- 549,557 ---- msg.walk())) + def octetparts(msg): + return Set(filter(lambda part: + part.get_content_type() == 'application/octet-stream', + msg.walk())) + url_re = re.compile(r""" (https? | ftp) # capture the protocol *************** *** 992,996 **** --- 997,1011 ---- part is ignored. Except in special cases, it's recommended to leave that at its default of false. + + If options.check_octets is True, the first few undecoded characters + of application/octet-stream parts of the message body become tokens. """ + + if options.check_octets: + # Find, decode application/octet-stream parts of the body, + # tokenizing the first few characters of each chunk + for part in octetparts(msg): + text = part.get_payload(decode=False) + yield "octet:%s" % text[:options.octet_prefix_size] # Find, decode (base64, qp), and tokenize textual parts of the body. From tim.one@comcast.net Mon Sep 23 05:33:18 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 23 Sep 2002 00:33:18 -0400 Subject: [Spambayes-checkins] spambayes Options.py,1.23,1.24tokenizer.py,1.30,1.31 In-Reply-To: Message-ID: > + def octetparts(msg): > + return Set(filter(lambda part: > + part.get_content_type() == > 'application/octet-stream', > + msg.walk())) I think Guido got rid of all uses of get_content_type, so that this code could be used with an older email pkg. From skip@pobox.com Mon Sep 23 13:57:47 2002 From: skip@pobox.com (Skip Montanaro) Date: Mon, 23 Sep 2002 07:57:47 -0500 Subject: [Spambayes-checkins] spambayes Options.py,1.23,1.24tokenizer.py,1.30,1.31 In-Reply-To: References: Message-ID: <15759.4043.426296.579486@12-248-11-90.client.attbi.com> >>>>> "Tim" == Tim Peters writes: >> + def octetparts(msg): >> + return Set(filter(lambda part: >> + part.get_content_type() == >> 'application/octet-stream', >> + msg.walk())) Tim> I think Guido got rid of all uses of get_content_type, so that this Tim> code could be used with an older email pkg. What is the correct replacement, part.get_type()? Skip From guido@python.org Mon Sep 23 14:18:36 2002 From: guido@python.org (Guido van Rossum) Date: Mon, 23 Sep 2002 09:18:36 -0400 Subject: [Spambayes-checkins] spambayes Options.py,1.23,1.24tokenizer.py,1.30,1.31 In-Reply-To: Your message of "Mon, 23 Sep 2002 07:57:47 CDT." <15759.4043.426296.579486@12-248-11-90.client.attbi.com> References: <15759.4043.426296.579486@12-248-11-90.client.attbi.com> Message-ID: <200209231318.g8NDIaQ06599@pcp02138704pcs.reston01.va.comcast.net> > >>>>> "Tim" == Tim Peters writes: > > >> + def octetparts(msg): > >> + return Set(filter(lambda part: > >> + part.get_content_type() == > >> 'application/octet-stream', > >> + msg.walk())) > > > Tim> I think Guido got rid of all uses of get_content_type, so that this > Tim> code could be used with an older email pkg. > > What is the correct replacement, part.get_type()? Since you're only comparing it with app/oct-str, yes. --Guido van Rossum (home page: http://www.python.org/~guido/) From barry@users.sourceforge.net Mon Sep 23 14:30:45 2002 From: barry@users.sourceforge.net (Barry Warsaw) Date: Mon, 23 Sep 2002 06:30:45 -0700 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.31,1.32 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv25327 Modified Files: tokenizer.py Log Message: Use the email 2.3 API, get_type() and friends -> get_content_type() and friends. The latter always returns a content type string, never None. Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.31 retrieving revision 1.32 diff -C2 -d -r1.31 -r1.32 *** tokenizer.py 23 Sep 2002 03:13:31 -0000 1.31 --- tokenizer.py 23 Sep 2002 13:30:42 -0000 1.32 *************** *** 537,541 **** text.add(htmlpart) ! elif part.get_main_type('text') == 'text': text.add(part) --- 537,541 ---- text.add(htmlpart) ! elif part.get_content_maintype() == 'text': text.add(part) *************** *** 546,550 **** # have redundant content, so it goes. def textparts(msg): ! return Set(filter(lambda part: part.get_main_type('text') == 'text', msg.walk())) --- 546,550 ---- # have redundant content, so it goes. def textparts(msg): ! return Set(filter(lambda part: part.get_content_maintype() == 'text', msg.walk())) *************** *** 716,722 **** def crack_content_xyz(msg): ! x = msg.get_type() ! if x is not None: ! yield 'content-type:' + x.lower() x = msg.get_param('type') --- 716,720 ---- def crack_content_xyz(msg): ! yield 'content-type:' + msg.get_content_type() x = msg.get_param('type') *************** *** 1036,1040 **** # Remove HTML/XML tags. ! if (part.get_type() == "text/plain" or not options.retain_pure_html_tags): text = html_re.sub(' ', text) --- 1034,1038 ---- # Remove HTML/XML tags. ! if (part.get_content_type() == "text/plain" or not options.retain_pure_html_tags): text = html_re.sub(' ', text) From bkc@users.sourceforge.net Mon Sep 23 14:55:21 2002 From: bkc@users.sourceforge.net (Brad Clements) Date: Mon, 23 Sep 2002 06:55:21 -0700 Subject: [Spambayes-checkins] spambayes cmp.py,1.9,1.10 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv12279 Modified Files: cmp.py Log Message: added mean and sdev reporting, and delta reporting Index: cmp.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/cmp.py,v retrieving revision 1.9 retrieving revision 1.10 diff -C2 -d -r1.9 -r1.10 *** cmp.py 19 Sep 2002 10:25:31 -0000 1.9 --- cmp.py 23 Sep 2002 13:55:18 -0000 1.10 *************** *** 22,35 **** def suck(f): fns = [] ! fps = [] get = f.readline while 1: line = get() ! if line.startswith('-> tested'): print line, if line.startswith('-> '): continue if line.startswith('total'): ! break # A line with an f-p rate and an f-n rate. p, n = map(float, line.split()) --- 22,55 ---- def suck(f): fns = [] ! fps = [] ! hamdev = [] ! spamdev = [] ! get = f.readline while 1: line = get() ! if line.startswith('-> tested'): print line, + if line.find('sample sdev') != -1: + vals = line.split(';') + mean = float(vals[1].split(' ')[-1]) + sdev = float(vals[2].split(' ')[-1]) + val = (mean,sdev) + typ = vals[0].split(' ')[2] + if line.find('for all runs') != -1: + if typ == 'Ham': + hamdevall = val + else: + spamdevall = val + elif line.find('all in this') != -1: + if typ == 'Ham': + hamdev.append(val) + else: + spamdev.append(val) + continue if line.startswith('-> '): continue if line.startswith('total'): ! break # A line with an f-p rate and an f-n rate. p, n = map(float, line.split()) *************** *** 45,53 **** fpmean = float(get().split()[-1]) fnmean = float(get().split()[-1]) ! return fps, fns, fptot, fntot, fpmean, fnmean def tag(p1, p2): if p1 == p2: ! t = "tied" else: t = p1 < p2 and "lost " or "won " --- 65,73 ---- fpmean = float(get().split()[-1]) fnmean = float(get().split()[-1]) ! return fps, fns, fptot, fntot, fpmean, fnmean, hamdev, spamdev,hamdevall,spamdevall def tag(p1, p2): if p1 == p2: ! t = "tied " else: t = p1 < p2 and "lost " or "won " *************** *** 58,62 **** t += " +(was 0)" return t ! def dump(p1s, p2s): alltags = "" --- 78,93 ---- t += " +(was 0)" return t ! ! def mtag(m1,m2): ! mean1,dev1 = m1 ! mean2,dev2 = m2 ! mp = (mean2 - mean1) * 100.0 / mean1 ! dp = (dev2 - dev1) * 100.0 / dev1 ! ! return "%2.2f %2.2f (%+2.2f%%) %2.2f %2.2f (%+2.2f%%)" % ( ! mean1,mean2,mp, ! dev1,dev2,dp ! ) ! def dump(p1s, p2s): alltags = "" *************** *** 69,72 **** --- 100,107 ---- print "%-4s %2d times" % (t, alltags.count(t)) print + + def dumpdev(meandev1,meandev2): + for m1,m2 in zip(meandev1,meandev2): + print mtag(m1, m2) def windowsfy(fn): *************** *** 83,88 **** f2n = windowsfy(f2n) ! fp1, fn1, fptot1, fntot1, fpmean1, fnmean1 = suck(file(f1n)) ! fp2, fn2, fptot2, fntot2, fpmean2, fnmean2 = suck(file(f2n)) print --- 118,123 ---- f2n = windowsfy(f2n) ! fp1, fn1, fptot1, fntot1, fpmean1, fnmean1,hamdev1,spamdev1,hamdevall1,spamdevall1 = suck(file(f1n)) ! fp2, fn2, fptot2, fntot2, fpmean2, fnmean2,hamdev2,spamdev2,hamdevall2,spamdevall2 = suck(file(f2n)) print *************** *** 97,98 **** --- 132,151 ---- print "total unique fn went from", fntot1, "to", fntot2, tag(fntot1, fntot2) print "mean fn % went from", fnmean1, "to", fnmean2, tag(fnmean1, fnmean2) + + print + print "ham mean ham sdev" + dumpdev(hamdev1,hamdev2) + print + print "ham mean and sdev for all runs" + dumpdev([hamdevall1],[hamdevall2]) + + print + print "spam mean spam sdev" + dumpdev(spamdev1,spamdev2) + print + print "spam mean and sdev for all runs" + dumpdev([spamdevall1],[spamdevall2]) + print + diff1 = spamdevall1[0] - hamdevall1[0] + diff2 = spamdevall2[0] - hamdevall2[0] + print "ham/spam mean difference: %2.2f %2.2f %+2.2f" % (diff1,diff2,(diff2-diff1)) From bkc@users.sourceforge.net Mon Sep 23 14:56:16 2002 From: bkc@users.sourceforge.net (Brad Clements) Date: Mon, 23 Sep 2002 06:56:16 -0700 Subject: [Spambayes-checkins] spambayes TestDriver.py,1.6,1.7 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv13218 Modified Files: TestDriver.py Log Message: changed mean and sdev output, added -> prefix for capture by rates.py and cmp.py Index: TestDriver.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** TestDriver.py 22 Sep 2002 08:31:48 -0000 1.6 --- TestDriver.py 23 Sep 2002 13:56:14 -0000 1.7 *************** *** 90,98 **** def printhist(tag, ham, spam): print ! print "Ham distribution for", tag ham.display() print ! print "Spam distribution for", tag spam.display() --- 90,98 ---- def printhist(tag, ham, spam): print ! print "-> Ham distribution for", tag, ham.display() print ! print "-> Spam distribution for", tag, spam.display() From montanaro@users.sourceforge.net Mon Sep 23 15:38:44 2002 From: montanaro@users.sourceforge.net (Skip Montanaro) Date: Mon, 23 Sep 2002 07:38:44 -0700 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.32,1.33 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv14069 Modified Files: tokenizer.py Log Message: replace get_content_type() with get_type() to allow running under 2.2.x Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.32 retrieving revision 1.33 diff -C2 -d -r1.32 -r1.33 *** tokenizer.py 23 Sep 2002 13:30:42 -0000 1.32 --- tokenizer.py 23 Sep 2002 14:38:41 -0000 1.33 *************** *** 551,555 **** def octetparts(msg): return Set(filter(lambda part: ! part.get_content_type() == 'application/octet-stream', msg.walk())) --- 551,555 ---- def octetparts(msg): return Set(filter(lambda part: ! part.get_type() == 'application/octet-stream', msg.walk())) From richiehindle@users.sourceforge.net Mon Sep 23 20:41:01 2002 From: richiehindle@users.sourceforge.net (Richie Hindle) Date: Mon, 23 Sep 2002 12:41:01 -0700 Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.2,1.3 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv4313 Modified Files: pop3proxy.py Log Message: Fixed a bug whereby your email client would see no traffic for ages, and hence potentially time out, when huge emails were proxy'd. It now reads for 30 seconds, and if the message is still arriving it classifies it based on what it's seen so far and starts returning it to the email client. Index: pop3proxy.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** pop3proxy.py 18 Sep 2002 22:01:39 -0000 1.2 --- pop3proxy.py 23 Sep 2002 19:40:58 -0000 1.3 *************** *** 26,30 **** """ ! import sys, re, operator, errno, getopt, cPickle, socket, asyncore, asynchat import classifier, tokenizer, hammie --- 26,39 ---- """ ! # This module is part of the spambayes project, which is Copyright 2002 ! # The Python Software Foundation and is covered by the Python Software ! # Foundation license. ! ! __author__ = "Richie Hindle " ! __credits__ = "Tim Peters, Neale Pickett, all the spambayes contributors." ! ! ! import sys, re, operator, errno, getopt, cPickle, time ! import socket, asyncore, asynchat import classifier, tokenizer, hammie *************** *** 76,80 **** asynchat.async_chat.__init__(self, clientSocket) self.request = '' - self.isClosing = False self.set_terminator('\r\n') serverSocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM) --- 85,88 ---- *************** *** 110,117 **** def readResponse(self, command, args): ! """Reads the POP3 server's response. Also sets self.isClosing ! to True if the server closes the socket, which tells ! found_terminator() to close when the response has been sent. """ isMulti = self.isMultiline(command, args) responseLines = [] --- 118,131 ---- def readResponse(self, command, args): ! """Reads the POP3 server's response and returns a tuple of ! (response, isClosing, timedOut). isClosing is True if the ! server closes the socket, which tells found_terminator() to ! close when the response has been sent. timedOut is set if the ! request was still arriving after 30 seconds, and tells ! found_terminator() to proxy the remainder of the response. """ + isClosing = False + timedOut = False + startTime = time.time() isMulti = self.isMultiline(command, args) responseLines = [] *************** *** 121,125 **** if not line: # The socket's been closed by the server, probably by QUIT. ! self.isClosing = True break elif not isMulti or (isFirstLine and line.startswith('-ERR')): --- 135,139 ---- if not line: # The socket's been closed by the server, probably by QUIT. ! isClosing = True break elif not isMulti or (isFirstLine and line.startswith('-ERR')): *************** *** 135,141 **** responseLines.append(line) isFirstLine = False ! return ''.join(responseLines) def collect_incoming_data(self, data): --- 149,161 ---- responseLines.append(line) + # Time out after 30 seconds - found_terminator() knows how + # to deal with this. + if time.time() > startTime + 30: + timedOut = True + break + isFirstLine = False ! return ''.join(responseLines), isClosing, timedOut def collect_incoming_data(self, data): *************** *** 146,155 **** """Asynchat override.""" # Send the request to the server and read the reply. - # XXX When the response is huge, the email client can time out. - # It should read as much as it can from the server, then if the - # response is still coming after say 30 seconds, it should - # classify the message based on that and send back the headers - # and the body so far. Then it should become a simple - # one-packet-at-a-time proxy for the rest of the response. if self.request.strip().upper() == 'KILL': self.serverFile.write('QUIT\r\n') --- 166,169 ---- *************** *** 168,172 **** command = splitCommand[0].upper() args = splitCommand[1:] ! rawResponse = self.readResponse(command, args) # Pass the request and the raw response to the subclass and --- 182,186 ---- command = splitCommand[0].upper() args = splitCommand[1:] ! rawResponse, isClosing, timedOut = self.readResponse(command, args) # Pass the request and the raw response to the subclass and *************** *** 176,184 **** self.request = '' ! # If readResponse() decided that the server had closed its ! # socket, close this one when the response has been sent. ! if self.isClosing: ! self.close_when_done() def handle_error(self): """Let SystemExit cause an exit.""" --- 190,216 ---- self.request = '' ! # If readResponse() timed out, we still need to read and proxy ! # the rest of the message. ! if timedOut: ! while True: ! line = self.serverFile.readline() ! if not line: ! # The socket's been closed by the server. ! isClosing = True ! break ! elif line == '.\r\n': ! # The termination line. ! self.push(line) ! break ! else: ! # A normal line. ! self.push(line) + # If readResponse() or the loop above decided that the server + # has closed its socket, close this one when the response has + # been sent. + if isClosing: + self.close_when_done() + def handle_error(self): """Let SystemExit cause an exit.""" *************** *** 492,496 **** def runProxy(): ! bayes = hammie.createbayes() BayesProxyListener('localhost', 8110, 8111, bayes) bayes.learn(tokenizer.tokenize(spam1), True) --- 524,529 ---- def runProxy(): ! # Name the database in case it ever gets auto-flushed to disk. ! bayes = hammie.createbayes('_pop3proxy.db') BayesProxyListener('localhost', 8110, 8111, bayes) bayes.learn(tokenizer.tokenize(spam1), True) From tim_one@users.sourceforge.net Mon Sep 23 21:03:09 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Mon, 23 Sep 2002 13:03:09 -0700 Subject: [Spambayes-checkins] spambayes msgs.py,NONE,1.1 README.txt,1.22,1.23 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv12319 Modified Files: README.txt Added Files: msgs.py Log Message: Preparing to refactor my test drivers. --- NEW FILE: msgs.py --- import os import random HAMKEEP = None SPAMKEEP = None SEED = random.randrange(2000000000) class Msg(object): __slots__ = 'tag', 'guts' def __init__(self, dir, name): path = dir + "/" + name self.tag = path f = open(path, 'rb') self.guts = f.read() f.close() def __iter__(self): return tokenize(self.guts) # Compare msgs by their paths; this is appropriate for sets of msgs. def __hash__(self): return hash(self.tag) def __eq__(self, other): return self.tag == other.tag def __str__(self): return self.guts # The iterator yields a stream of Msg objects, taken from a list of directories. class MsgStream(object): __slots__ = 'tag', 'directories', 'keep' def __init__(self, tag, directories, keep=None): self.tag = tag self.directories = directories self.keep = keep def __str__(self): return self.tag def produce(self): if self.keep is None: for directory in self.directories: for fname in os.listdir(directory): yield Msg(directory, fname) return # We only want part of the msgs. Shuffle each directory list, but # in such a way that we'll get the same result each time this is # called on the same directory list. for directory in self.directories: all = os.listdir(directory) random.seed(hash(max(all)) ^ SEED) # reproducible across calls random.shuffle(all) del all[self.keep:] all.sort() # seems to speed access on Win98! for fname in all: yield Msg(directory, fname) def __iter__(self): return self.produce() class HamStream(MsgStream): def __init__(self, tag, directories): MsgStream.__init__(self, tag, directories, HAMKEEP) class SpamStream(MsgStream): def __init__(self, tag, directories): MsgStream.__init__(self, tag, directories, SPAMKEEP) def setparms(hamkeep, spamkeep, seed=None): """Set HAMKEEP and SPAMKEEP. If seed is not None, also set SEED.""" global HAMKEEP, SPAMKEEP, SEED HAMKEEP, SPAMKEEP = hamkeep, spamkeep if seed is not None: SEED = seed Index: README.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/README.txt,v retrieving revision 1.22 retrieving revision 1.23 diff -C2 -d -r1.22 -r1.23 *** README.txt 22 Sep 2002 04:59:54 -0000 1.22 --- README.txt 23 Sep 2002 20:03:06 -0000 1.23 *************** *** 60,63 **** --- 60,67 ---- cmp.py below. + msgs.py + Some simple classes to wrap raw msgs, and to produce streams of + msgs. The test drivers use these. + Apps From tim_one@users.sourceforge.net Mon Sep 23 21:18:36 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Mon, 23 Sep 2002 13:18:36 -0700 Subject: [Spambayes-checkins] spambayes msgs.py,1.1,1.2 timtest.py,1.27,1.28 timcv.py,1.7,1.8 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv16552 Modified Files: msgs.py timtest.py timcv.py Log Message: Refactored my c-v and grid test drivers to cut code duplication. Of course this created more duplication too . The particular reason for upgrading the grid driver is that the c-v driver really can't be used with Gary Robinson's central-limit approach: incrementally updating a classifier given the three-pass training procedure needed looks *hard*. The grid driver doesn't try to incrementally change the classifiers it builds. Index: msgs.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/msgs.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** msgs.py 23 Sep 2002 20:03:05 -0000 1.1 --- msgs.py 23 Sep 2002 20:18:34 -0000 1.2 *************** *** 2,5 **** --- 2,7 ---- import random + from tokenizer import tokenize + HAMKEEP = None SPAMKEEP = None *************** *** 29,33 **** return self.guts ! # The iterator yields a stream of Msg objects, taken from a list of directories. class MsgStream(object): __slots__ = 'tag', 'directories', 'keep' --- 31,36 ---- return self.guts ! # The iterator yields a stream of Msg objects, taken from a list of ! # directories. class MsgStream(object): __slots__ = 'tag', 'directories', 'keep' Index: timtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timtest.py,v retrieving revision 1.27 retrieving revision 1.28 diff -C2 -d -r1.27 -r1.28 *** timtest.py 22 Sep 2002 06:58:36 -0000 1.27 --- timtest.py 23 Sep 2002 20:18:34 -0000 1.28 *************** *** 1,9 **** #! /usr/bin/env python - # At the moment, this requires Python 2.3 from CVS (heapq, Set, enumerate). # A test driver using "the standard" test directory structure. See also ! # rates.py and cmp.py for summarizing results. ! """Usage: %(program)s [-h] -n nsets Where: --- 1,9 ---- #! /usr/bin/env python # A test driver using "the standard" test directory structure. See also ! # rates.py and cmp.py for summarizing results. This runs an NxN test grid, ! # skipping the diagonal. ! """Usage: %(program)s [options] -n nsets Where: *************** *** 14,17 **** --- 14,32 ---- This is required. + If you only want to use some of the messages in each set, + + --ham-keep int + The maximum number of msgs to use from each Ham set. The msgs are + chosen randomly. See also the -s option. + + --spam-keep int + The maximum number of msgs to use from each Spam set. The msgs are + chosen randomly. See also the -s option. + + -s int + A seed for the random number generator. Has no effect unless + at least on of {--ham-keep, --spam-keep} is specified. If -s + isn't specifed, the seed is taken from current time. + In addition, an attempt is made to merge bayescustomize.ini into the options. If that exists, it can be used to change the settings in Options.options. *************** *** 20,29 **** from __future__ import generators - import os import sys from Options import options ! from tokenizer import tokenize ! from TestDriver import Driver program = sys.argv[0] --- 35,43 ---- from __future__ import generators import sys from Options import options ! import TestDriver ! import msgs program = sys.argv[0] *************** *** 37,85 **** sys.exit(code) - class Msg(object): - def __init__(self, dir, name): - path = dir + "/" + name - self.tag = path - f = open(path, 'rb') - guts = f.read() - f.close() - self.guts = guts - - def __iter__(self): - return tokenize(self.guts) - - def __hash__(self): - return hash(self.tag) - - def __eq__(self, other): - return self.tag == other.tag - - def __str__(self): - return self.guts - - class MsgStream(object): - def __init__(self, directory): - self.directory = directory - - def __str__(self): - return self.directory - - def produce(self): - directory = self.directory - for fname in os.listdir(directory): - yield Msg(directory, fname) - - def xproduce(self): - import random - directory = self.directory - all = os.listdir(directory) - random.seed(hash(directory)) - random.shuffle(all) - for fname in all[-1500:-1300:]: - yield Msg(directory, fname) - - def __iter__(self): - return self.produce() - def drive(nsets): print options.display() --- 51,54 ---- *************** *** 89,112 **** spamhamdirs = zip(spamdirs, hamdirs) ! d = Driver() for spamdir, hamdir in spamhamdirs: d.new_classifier() ! d.train(MsgStream(hamdir), MsgStream(spamdir)) for sd2, hd2 in spamhamdirs: if (sd2, hd2) == (spamdir, hamdir): continue ! d.test(MsgStream(hd2), MsgStream(sd2)) d.finishtest() d.alldone() ! if __name__ == "__main__": import getopt try: ! opts, args = getopt.getopt(sys.argv[1:], 'hn:') except getopt.error, msg: usage(1, msg) ! nsets = None for opt, arg in opts: if opt == '-h': --- 58,84 ---- spamhamdirs = zip(spamdirs, hamdirs) ! d = TestDriver.Driver() for spamdir, hamdir in spamhamdirs: d.new_classifier() ! d.train(msgs.HamStream(hamdir, [hamdir]), ! msgs.SpamStream(spamdir, [spamdir])) for sd2, hd2 in spamhamdirs: if (sd2, hd2) == (spamdir, hamdir): continue ! d.test(msgs.HamStream(hd2, [hd2]), ! msgs.SpamStream(sd2, [sd2])) d.finishtest() d.alldone() ! def main(): import getopt try: ! opts, args = getopt.getopt(sys.argv[1:], 'hn:s:', ! ['ham-keep=', 'spam-keep=']) except getopt.error, msg: usage(1, msg) ! nsets = seed = hamkeep = spamkeep = None for opt, arg in opts: if opt == '-h': *************** *** 114,117 **** --- 86,95 ---- elif opt == '-n': nsets = int(arg) + elif opt == '-s': + seed = int(arg) + elif opt == '--ham-keep': + hamkeep = int(arg) + elif opt == '--spam-keep': + spamkeep = int(arg) if args: *************** *** 120,122 **** --- 98,104 ---- usage(1, "-n is required") + msgs.setparms(hamkeep, spamkeep, seed) drive(nsets) + + if __name__ == "__main__": + main() Index: timcv.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timcv.py,v retrieving revision 1.7 retrieving revision 1.8 diff -C2 -d -r1.7 -r1.8 *** timcv.py 22 Sep 2002 06:58:36 -0000 1.7 --- timcv.py 23 Sep 2002 20:18:34 -0000 1.8 *************** *** 1,4 **** #! /usr/bin/env python - # At the moment, this requires Python 2.3 from CVS (heapq, Set, enumerate). # A driver for N-fold cross validation. --- 1,3 ---- *************** *** 34,48 **** from __future__ import generators - import os import sys - import random from Options import options - from tokenizer import tokenize import TestDriver ! ! HAMKEEP = None ! SPAMKEEP = None ! SEED = random.randrange(2000000000) program = sys.argv[0] --- 33,41 ---- from __future__ import generators import sys from Options import options import TestDriver ! import msgs program = sys.argv[0] *************** *** 56,122 **** sys.exit(code) - class Msg(object): - __slots__ = 'tag', 'guts' - - def __init__(self, dir, name): - path = dir + "/" + name - self.tag = path - f = open(path, 'rb') - self.guts = f.read() - f.close() - - def __iter__(self): - return tokenize(self.guts) - - # Compare msgs by their paths; this is appropriate for sets of msgs. - def __hash__(self): - return hash(self.tag) - - def __eq__(self, other): - return self.tag == other.tag - - def __str__(self): - return self.guts - - class MsgStream(object): - __slots__ = 'tag', 'directories', 'keep' - - def __init__(self, tag, directories, keep=None): - self.tag = tag - self.directories = directories - self.keep = keep - - def __str__(self): - return self.tag - - def produce(self): - if self.keep is None: - for directory in self.directories: - for fname in os.listdir(directory): - yield Msg(directory, fname) - return - # We only want part of the msgs. Shuffle each directory list, but - # in such a way that we'll get the same result each time this is - # called on the same directory list. - for directory in self.directories: - all = os.listdir(directory) - random.seed(hash(max(all)) ^ SEED) # reproducible across calls - random.shuffle(all) - del all[self.keep:] - all.sort() # seems to speed access on Win98! - for fname in all: - yield Msg(directory, fname) - - def __iter__(self): - return self.produce() - - class HamStream(MsgStream): - def __init__(self, tag, directories): - MsgStream.__init__(self, tag, directories, HAMKEEP) - - class SpamStream(MsgStream): - def __init__(self, tag, directories): - MsgStream.__init__(self, tag, directories, SPAMKEEP) - def drive(nsets): print options.display() --- 49,52 ---- *************** *** 127,132 **** d = TestDriver.Driver() # Train it on all sets except the first. ! d.train(HamStream("%s-%d" % (hamdirs[1], nsets), hamdirs[1:]), ! SpamStream("%s-%d" % (spamdirs[1], nsets), spamdirs[1:])) # Now run nsets times, predicting pair i against all except pair i. --- 57,62 ---- d = TestDriver.Driver() # Train it on all sets except the first. ! d.train(msgs.HamStream("%s-%d" % (hamdirs[1], nsets), hamdirs[1:]), ! msgs.SpamStream("%s-%d" % (spamdirs[1], nsets), spamdirs[1:])) # Now run nsets times, predicting pair i against all except pair i. *************** *** 134,139 **** h = hamdirs[i] s = spamdirs[i] ! hamstream = HamStream(h, [h]) ! spamstream = SpamStream(s, [s]) if i > 0: --- 64,69 ---- h = hamdirs[i] s = spamdirs[i] ! hamstream = msgs.HamStream(h, [h]) ! spamstream = msgs.SpamStream(s, [s]) if i > 0: *************** *** 152,156 **** def main(): - global SEED, HAMKEEP, SPAMKEEP import getopt --- 82,85 ---- *************** *** 161,165 **** usage(1, msg) ! nsets = seed = None for opt, arg in opts: if opt == '-h': --- 90,94 ---- usage(1, msg) ! nsets = seed = hamkeep = spamkeep = None for opt, arg in opts: if opt == '-h': *************** *** 170,176 **** seed = int(arg) elif opt == '--ham-keep': ! HAMKEEP = int(arg) elif opt == '--spam-keep': ! SPAMKEEP = int(arg) if args: --- 99,105 ---- seed = int(arg) elif opt == '--ham-keep': ! hamkeep = int(arg) elif opt == '--spam-keep': ! spamkeep = int(arg) if args: *************** *** 178,184 **** if nsets is None: usage(1, "-n is required") - if seed is not None: - SEED = seed drive(nsets) --- 107,112 ---- if nsets is None: usage(1, "-n is required") + msgs.setparms(hamkeep, spamkeep, seed) drive(nsets) From tim_one@users.sourceforge.net Mon Sep 23 22:19:10 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Mon, 23 Sep 2002 14:19:10 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.24,1.25 TestDriver.py,1.7,1.8classifier.py,1.17,1.18 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv9358 Modified Files: Options.py TestDriver.py classifier.py Log Message: New option Classifier/use_central_limit. Read the comments in Options. Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.24 retrieving revision 1.25 diff -C2 -d -r1.24 -r1.25 *** Options.py 23 Sep 2002 03:13:30 -0000 1.24 --- Options.py 23 Sep 2002 21:19:08 -0000 1.25 *************** *** 185,188 **** --- 185,200 ---- # want a higher spam_cutoff. robinson_minimum_prob_strength: 0.0 + + ########################################################################### + # More speculative options for Gary Robinson's central-limit. These may go + # away, or a bunch of incompatible stuff above may go away. + + # Use a central-limit approach for scoring. + # The number of extremes to use is given by max_discriminators (above). + # spam_cutoff should almost certainly be exactly 0.5 when using this approach. + # DO NOT run cross-validation tests when this is enabled! They'll deliver + # nonense, or, if you're lucky, will blow up with division by 0 or negative + # square roots. An NxN test grid should work fine. + use_central_limit: False """ *************** *** 230,233 **** --- 242,247 ---- 'use_robinson_ranking': boolean_cracker, 'robinson_minimum_prob_strength': float_cracker, + + 'use_central_limit': boolean_cracker, }, } Index: TestDriver.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v retrieving revision 1.7 retrieving revision 1.8 diff -C2 -d -r1.7 -r1.8 *** TestDriver.py 23 Sep 2002 13:56:14 -0000 1.7 --- TestDriver.py 23 Sep 2002 21:19:08 -0000 1.8 *************** *** 124,127 **** --- 124,129 ---- self.trained_spam_hist = Hist(options.nbuckets) + # CAUTION: this just doesn't work for incrememental training when + # options.use_central_limit is in effect. def train(self, ham, spam): print "-> Training on", ham, "&", spam, "...", *************** *** 130,134 **** --- 132,140 ---- self.tester.train(ham, spam) print c.nham - nham, "hams &", c.nspam- nspam, "spams" + c.compute_population_stats(ham, False) + c.compute_population_stats(spam, True) + # CAUTION: this just doesn't work for incrememental training when + # options.use_central_limit is in effect. def untrain(self, ham, spam): print "-> Forgetting", ham, "&", spam, "...", Index: classifier.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/classifier.py,v retrieving revision 1.17 retrieving revision 1.18 diff -C2 -d -r1.17 -r1.18 *** classifier.py 21 Sep 2002 20:25:49 -0000 1.17 --- classifier.py 23 Sep 2002 21:19:08 -0000 1.18 *************** *** 221,224 **** --- 221,248 ---- 'nspam', # number of spam messages learn() has seen 'nham', # number of non-spam messages learn() has seen + + # The rest is unique to the central-limit code. + # n is the # of data points in the population. + # sum is the sum of the probabilities, and is a long scaled + # by 2**64. + # sumsq is the sum of the squares of the probabilities, and + # is a long scaled by 2**128. + # mean is the mean probability of the population, as an + # unscaled float. + # var is the variance of the population, as unscaled float. + # There's one set of these for the spam population, and + # another for the ham population. + # XXX If this code survives, clean it up. + 'spamn', + 'spamsum', + 'spamsumsq', + 'spammean', + 'spamvar', + + 'hamn', + 'hamsum', + 'hamsumsq', + 'hammean', + 'hamvar', ) *************** *** 226,229 **** --- 250,256 ---- self.wordinfo = {} self.nspam = self.nham = 0 + self.spamn = self.hamn = 0 + self.spamsum = self.spamsumsq = 0 + self.hamsum = self.hamsumsq = 0 def __getstate__(self): *************** *** 451,454 **** --- 478,511 ---- del self.wordinfo[word] + def compute_population_stats(self, msgstream, is_spam): + pass + + # XXX More stuff should be reworked to use this as a helper function. + def _getclues(self, wordstream): + # A priority queue to remember the MAX_DISCRIMINATORS best + # probabilities, where "best" means largest distance from 0.5. + # The tuples are (distance, prob, word, record). + nbest = [(-1.0, None, None, None)] * options.max_discriminators + smallest_best = -1.0 + + wordinfoget = self.wordinfo.get + now = time.time() + for word in Set(wordstream): + record = wordinfoget(word) + if record is None: + prob = UNKNOWN_SPAMPROB + else: + record.atime = now + prob = record.spamprob + + distance = abs(prob - 0.5) + if distance > smallest_best: + heapreplace(nbest, (distance, prob, word, record)) + smallest_best = nbest[0][0] + + clues = [(prob, word, record) + for distance, prob, word, record in nbest + if prob is not None] + return clues #************************************************************************ *************** *** 599,603 **** self.wordinfo[word] = record - if options.use_robinson_probability: update_probabilities = robinson_update_probabilities --- 656,744 ---- self.wordinfo[word] = record if options.use_robinson_probability: update_probabilities = robinson_update_probabilities + + def central_limit_compute_population_stats(self, msgstream, is_spam): + from math import ldexp + + sum = sumsq = 0 + seen = {} + for msg in msgstream: + for prob, word, record in self._getclues(msg): + if word in seen: + continue + seen[word] = 1 + prob = long(ldexp(prob, 64)) + sum += prob + sumsq += prob * prob + n = len(seen) + + if is_spam: + self.spamn, self.spamsum, self.spamsumsq = n, sum, sumsq + spamsum = self.spamsum + self.spammean = ldexp(spamsum, -64) / self.spamn + spamvar = self.spamsumsq * self.spamn - spamsum**2 + self.spamvar = ldexp(spamvar, -128) / (self.spamn ** 2) + print 'spammean', self.spammean, 'spamvar', self.spamvar + else: + self.hamn, self.hamsum, self.hamsumsq = n, sum, sumsq + hamsum = self.hamsum + self.hammean = ldexp(hamsum, -64) / self.hamn + hamvar = self.hamsumsq * self.hamn - hamsum**2 + self.hamvar = ldexp(hamvar, -128) / (self.hamn ** 2) + print 'hammean', self.hammean, 'hamvar', self.hamvar + + if options.use_central_limit: + compute_population_stats = central_limit_compute_population_stats + + def central_limit_spamprob(self, wordstream, evidence=False): + """Return best-guess probability that wordstream is spam. + + wordstream is an iterable object producing words. + The return value is a float in [0.0, 1.0]. + + If optional arg evidence is True, the return value is a pair + probability, evidence + where evidence is a list of (word, probability) pairs. + """ + + from math import sqrt + + clues = self._getclues(wordstream) + sum = 0.0 + for prob, word, record in clues: + sum += prob + if record is not None: + record.killcount += 1 + n = len(clues) + if n == 0: + return 0.5 + mean = sum / n + + # If this sample is drawn from the spam population, its mean is + # distributed around spammean with variance spamvar/n. Likewise + # for if it's drawn from the ham population. Compute a normalized + # z-score (how many stddevs is it away from the population mean?) + # against both populations, and then it's ham or spam depending + # on which population it matches better. + zham = (mean - self.hammean) / sqrt(self.hamvar / n) + zspam = (mean - self.spammean) / sqrt(self.spamvar / n) + stat = abs(zham) - abs(zspam) # > 0 for spam, < 0 for ham + + # Normalize into [0, 1]. I'm arbitrarily clipping it to fit in + # [-20, 20] first. 20 is a massive z-score difference. + if stat < -20.0: + stat = -20.0 + elif stat > 20.0: + stat = 20.0 + stat = 0.5 + stat / 40.0 + + if evidence: + clues = [(word, prob) for prob, word, record in clues] + clues.sort(lambda a, b: cmp(a[1], b[1])) + return stat, clues + else: + return stat + + if options.use_central_limit: + spamprob = central_limit_spamprob From tim_one@users.sourceforge.net Mon Sep 23 22:20:13 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Mon, 23 Sep 2002 14:20:13 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.25,1.26 cdb.py,1.3,1.4 cmp.py,1.10,1.11 hammie.py,1.19,1.20 hammiesrv.py,1.1,1.2 loosecksum.py,1.2,1.3 mboxtest.py,1.8,1.9 msgs.py,1.2,1.3 pop3proxy.py,1.3,1.4 setup.py,1.3,1.4 splitndirs.py,1.3,1.4 unheader.py,1.2,1.3 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv9970 Modified Files: Options.py cdb.py cmp.py hammie.py hammiesrv.py loosecksum.py mboxtest.py msgs.py pop3proxy.py setup.py splitndirs.py unheader.py Log Message: Whitespace normalization. Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.25 retrieving revision 1.26 diff -C2 -d -r1.25 -r1.26 *** Options.py 23 Sep 2002 21:19:08 -0000 1.25 --- Options.py 23 Sep 2002 21:20:10 -0000 1.26 *************** *** 301,303 **** else: options.mergefiles(['bayescustomize.ini']) - --- 301,302 ---- Index: cdb.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/cdb.py,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** cdb.py 22 Sep 2002 06:58:36 -0000 1.3 --- cdb.py 23 Sep 2002 21:20:10 -0000 1.4 *************** *** 19,23 **** def uint32_pack(n): return struct.pack(' 0: driver.untrain(hams, spams) ! driver.test(hams, spams) driver.finishtest() --- 161,165 ---- if i > 0: driver.untrain(hams, spams) ! driver.test(hams, spams) driver.finishtest() *************** *** 167,171 **** if i < NSETS - 1: driver.train(hams, spams) ! i += 1 driver.alldone() --- 167,171 ---- if i < NSETS - 1: driver.train(hams, spams) ! i += 1 driver.alldone() Index: msgs.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/msgs.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** msgs.py 23 Sep 2002 20:18:34 -0000 1.2 --- msgs.py 23 Sep 2002 21:20:10 -0000 1.3 *************** *** 79,81 **** HAMKEEP, SPAMKEEP = hamkeep, spamkeep if seed is not None: ! SEED = seed \ No newline at end of file --- 79,81 ---- HAMKEEP, SPAMKEEP = hamkeep, spamkeep if seed is not None: ! SEED = seed Index: pop3proxy.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** pop3proxy.py 23 Sep 2002 19:40:58 -0000 1.3 --- pop3proxy.py 23 Sep 2002 21:20:10 -0000 1.4 *************** *** 12,16 **** defaults to 110. ! options (the same as hammie): -p FILE : use the named data file -d : the file is a DBM file rather than a pickle --- 12,16 ---- defaults to 110. ! options (the same as hammie): -p FILE : use the named data file -d : the file is a DBM file rather than a pickle *************** *** 46,50 **** dispatchers created by a factory callable. """ ! def __init__(self, port, factory, factoryArgs=(), socketMap=asyncore.socket_map): --- 46,50 ---- dispatchers created by a factory callable. """ ! def __init__(self, port, factory, factoryArgs=(), socketMap=asyncore.socket_map): *************** *** 81,85 **** server). """ ! def __init__(self, clientSocket, serverName, serverPort): asynchat.async_chat.__init__(self, clientSocket) --- 81,85 ---- server). """ ! def __init__(self, clientSocket, serverName, serverPort): asynchat.async_chat.__init__(self, clientSocket) *************** *** 90,98 **** self.serverFile = serverSocket.makefile() self.push(self.serverFile.readline()) ! def handle_connect(self): """Suppress the asyncore "unhandled connect event" warning.""" pass ! def onTransaction(self, command, args, response): """Overide this. Takes the raw request and the response, and --- 90,98 ---- self.serverFile = serverSocket.makefile() self.push(self.serverFile.readline()) ! def handle_connect(self): """Suppress the asyncore "unhandled connect event" warning.""" pass ! def onTransaction(self, command, args, response): """Overide this. Takes the raw request and the response, and *************** *** 101,105 **** """ raise NotImplementedError ! def isMultiline(self, command, args): """Returns True if the given request should get a multiline --- 101,105 ---- """ raise NotImplementedError ! def isMultiline(self, command, args): """Returns True if the given request should get a multiline *************** *** 116,120 **** # Assume that unknown commands will get an error response. return False ! def readResponse(self, command, args): """Reads the POP3 server's response and returns a tuple of --- 116,120 ---- # Assume that unknown commands will get an error response. return False ! def readResponse(self, command, args): """Reads the POP3 server's response and returns a tuple of *************** *** 148,152 **** # A normal line - append it to the response and carry on. responseLines.append(line) ! # Time out after 30 seconds - found_terminator() knows how # to deal with this. --- 148,152 ---- # A normal line - append it to the response and carry on. responseLines.append(line) ! # Time out after 30 seconds - found_terminator() knows how # to deal with this. *************** *** 154,166 **** timedOut = True break ! isFirstLine = False ! return ''.join(responseLines), isClosing, timedOut ! def collect_incoming_data(self, data): """Asynchat override.""" self.request = self.request + data ! def found_terminator(self): """Asynchat override.""" --- 154,166 ---- timedOut = True break ! isFirstLine = False ! return ''.join(responseLines), isClosing, timedOut ! def collect_incoming_data(self, data): """Asynchat override.""" self.request = self.request + data ! def found_terminator(self): """Asynchat override.""" *************** *** 183,187 **** args = splitCommand[1:] rawResponse, isClosing, timedOut = self.readResponse(command, args) ! # Pass the request and the raw response to the subclass and # send back the cooked response. --- 183,187 ---- args = splitCommand[1:] rawResponse, isClosing, timedOut = self.readResponse(command, args) ! # Pass the request and the raw response to the subclass and # send back the cooked response. *************** *** 189,193 **** self.push(cookedResponse) self.request = '' ! # If readResponse() timed out, we still need to read and proxy # the rest of the message. --- 189,193 ---- self.push(cookedResponse) self.request = '' ! # If readResponse() timed out, we still need to read and proxy # the rest of the message. *************** *** 206,210 **** # A normal line. self.push(line) ! # If readResponse() or the loop above decided that the server # has closed its socket, close this one when the response has --- 206,210 ---- # A normal line. self.push(line) ! # If readResponse() or the loop above decided that the server # has closed its socket, close this one when the response has *************** *** 212,216 **** if isClosing: self.close_when_done() ! def handle_error(self): """Let SystemExit cause an exit.""" --- 212,216 ---- if isClosing: self.close_when_done() ! def handle_error(self): """Let SystemExit cause an exit.""" *************** *** 220,224 **** else: asynchat.async_chat.handle_error(self) ! class BayesProxyListener(Listener): --- 220,224 ---- else: asynchat.async_chat.handle_error(self) ! class BayesProxyListener(Listener): *************** *** 226,230 **** BayesProxy objects to serve them. """ ! def __init__(self, serverName, serverPort, proxyPort, bayes): proxyArgs = (serverName, serverPort, bayes) --- 226,230 ---- BayesProxy objects to serve them. """ ! def __init__(self, serverName, serverPort, proxyPort, bayes): proxyArgs = (serverName, serverPort, bayes) *************** *** 235,243 **** """Proxies between an email client and a POP3 server, inserting judgement headers. It acts on the following POP3 commands: ! o STAT: o Adds the size of all the judgement headers to the maildrop size. ! o LIST: o With no message number: adds the size of an judgement header --- 235,243 ---- """Proxies between an email client and a POP3 server, inserting judgement headers. It acts on the following POP3 commands: ! o STAT: o Adds the size of all the judgement headers to the maildrop size. ! o LIST: o With no message number: adds the size of an judgement header *************** *** 245,253 **** o With a message number: adds the size of an judgement header to the message size. ! o RETR: o Adds the judgement header based on the raw headers and body of the message. ! o TOP: o Adds the judgement header based on the raw headers and as --- 245,253 ---- o With a message number: adds the size of an judgement header to the message size. ! o RETR: o Adds the judgement header based on the raw headers and body of the message. ! o TOP: o Adds the judgement header based on the raw headers and as *************** *** 268,272 **** self.handlers = {'STAT': self.onStat, 'LIST': self.onList, 'RETR': self.onRetr, 'TOP': self.onTop} ! def send(self, data): """Logs the data to the log file.""" --- 268,272 ---- self.handlers = {'STAT': self.onStat, 'LIST': self.onList, 'RETR': self.onRetr, 'TOP': self.onTop} ! def send(self, data): """Logs the data to the log file.""" *************** *** 274,278 **** self.logFile.flush() return POP3ProxyBase.send(self, data) ! def recv(self, size): """Logs the data to the log file.""" --- 274,278 ---- self.logFile.flush() return POP3ProxyBase.send(self, data) ! def recv(self, size): """Logs the data to the log file.""" *************** *** 281,285 **** self.logFile.flush() return data ! def onTransaction(self, command, args, response): """Takes the raw request and response, and returns the --- 281,285 ---- self.logFile.flush() return data ! def onTransaction(self, command, args, response): """Takes the raw request and response, and returns the *************** *** 299,303 **** else: return response ! def onList(self, command, args, response): """Adds the size of an judgement header to the message --- 299,303 ---- else: return response ! def onList(self, command, args, response): """Adds the size of an judgement header to the message *************** *** 323,327 **** else: return response ! def onRetr(self, command, args, response): """Adds the judgement header based on the raw headers and body --- 323,327 ---- else: return response ! def onRetr(self, command, args, response): """Adds the judgement header based on the raw headers and body *************** *** 332,336 **** # Break off the first line, which will be '+OK'. ok, message = response.split('\n', 1) ! # Now find the spam disposition and add the header. The # trailing space in "No " ensures consistent lengths - this --- 332,336 ---- # Break off the first line, which will be '+OK'. ok, message = response.split('\n', 1) ! # Now find the spam disposition and add the header. The # trailing space in "No " ensures consistent lengths - this *************** *** 412,416 **** """Listener for TestPOP3Server. Works on port 8110, to co-exist with real POP3 servers.""" ! def __init__(self, socketMap=asyncore.socket_map): Listener.__init__(self, 8110, TestPOP3Server, socketMap=socketMap) --- 412,416 ---- """Listener for TestPOP3Server. Works on port 8110, to co-exist with real POP3 servers.""" ! def __init__(self, socketMap=asyncore.socket_map): Listener.__init__(self, 8110, TestPOP3Server, socketMap=socketMap) *************** *** 423,427 **** kill it. The mail content is the example messages above. """ ! def __init__(self, clientSocket, socketMap=asyncore.socket_map): # Grumble: asynchat.__init__ doesn't take a 'map' argument, --- 423,427 ---- kill it. The mail content is the example messages above. """ ! def __init__(self, clientSocket, socketMap=asyncore.socket_map): # Grumble: asynchat.__init__ doesn't take a 'map' argument, *************** *** 438,450 **** self.push("+OK ready\r\n") self.request = '' ! def handle_connect(self): """Suppress the asyncore "unhandled connect event" warning.""" pass ! def collect_incoming_data(self, data): """Asynchat override.""" self.request = self.request + data ! def found_terminator(self): """Asynchat override.""" --- 438,450 ---- self.push("+OK ready\r\n") self.request = '' ! def handle_connect(self): """Suppress the asyncore "unhandled connect event" warning.""" pass ! def collect_incoming_data(self, data): """Asynchat override.""" self.request = self.request + data ! def found_terminator(self): """Asynchat override.""" *************** *** 464,468 **** self.push(handler(command, args)) self.request = '' ! def handle_error(self): """Let SystemExit cause an exit.""" --- 464,468 ---- self.push(handler(command, args)) self.request = '' ! def handle_error(self): """Let SystemExit cause an exit.""" *************** *** 472,476 **** else: asynchat.async_chat.handle_error(self) ! def onStat(self, command, args): """POP3 STAT command.""" --- 472,476 ---- else: asynchat.async_chat.handle_error(self) ! def onStat(self, command, args): """POP3 STAT command.""" *************** *** 478,482 **** maildropSize += len(self.maildrop) * len(HEADER_EXAMPLE) return "+OK %d %d\r\n" % (len(self.maildrop), maildropSize) ! def onList(self, command, args): """POP3 LIST command, with optional message number argument.""" --- 478,482 ---- maildropSize += len(self.maildrop) * len(HEADER_EXAMPLE) return "+OK %d %d\r\n" % (len(self.maildrop), maildropSize) ! def onList(self, command, args): """POP3 LIST command, with optional message number argument.""" *************** *** 494,498 **** returnLines.append(".") return '\r\n'.join(returnLines) + '\r\n' ! def onRetr(self, command, args): """POP3 RETR command.""" --- 494,498 ---- returnLines.append(".") return '\r\n'.join(returnLines) + '\r\n' ! def onRetr(self, command, args): """POP3 RETR command.""" *************** *** 522,526 **** testServerReady.set() asyncore.loop(map=testSocketMap) ! def runProxy(): # Name the database in case it ever gets auto-flushed to disk. --- 522,526 ---- testServerReady.set() asyncore.loop(map=testSocketMap) ! def runProxy(): # Name the database in case it ever gets auto-flushed to disk. *************** *** 534,543 **** testServerReady.wait() threading.Thread(target=runProxy).start() ! # Connect to the proxy. proxy = socket.socket(socket.AF_INET, socket.SOCK_STREAM) proxy.connect(('localhost', 8111)) assert proxy.recv(100) == "+OK ready\r\n" ! # Stat the mailbox to get the number of messages. proxy.send("stat\r\n") --- 534,543 ---- testServerReady.wait() threading.Thread(target=runProxy).start() ! # Connect to the proxy. proxy = socket.socket(socket.AF_INET, socket.SOCK_STREAM) proxy.connect(('localhost', 8111)) assert proxy.recv(100) == "+OK ready\r\n" ! # Stat the mailbox to get the number of messages. proxy.send("stat\r\n") *************** *** 546,550 **** print "%d messages in test mailbox" % count assert count == 2 ! # Loop through the messages ensuring that they have judgement # headers. --- 546,550 ---- print "%d messages in test mailbox" % count assert count == 2 ! # Loop through the messages ensuring that they have judgement # headers. *************** *** 559,563 **** header = response[headerOffset:headerEnd].strip() print "Message %d: %s" % (i, header) ! # Kill the proxy and the test server. proxy.sendall("kill\r\n") --- 559,563 ---- header = response[headerOffset:headerEnd].strip() print "Message %d: %s" % (i, header) ! # Kill the proxy and the test server. proxy.sendall("kill\r\n") *************** *** 592,596 **** elif opt == '-p': pickleName = arg ! # Do whatever we've been asked to do... if not opts and not args: --- 592,596 ---- elif opt == '-p': pickleName = arg ! # Do whatever we've been asked to do... if not opts and not args: *************** *** 598,615 **** test() print "Self-test passed." # ...else it would have asserted. ! elif runTestServer: print "Running a test POP3 server on port 8110..." TestListener() asyncore.loop() ! elif len(args) == 1: # Named POP3 server, default port. main(args[0], 110, 110, pickleName, useDB) ! elif len(args) == 2: # Named POP3 server, named port. main(args[0], int(args[1]), 110, pickleName, useDB) ! else: print >>sys.stderr, __doc__ --- 598,615 ---- test() print "Self-test passed." # ...else it would have asserted. ! elif runTestServer: print "Running a test POP3 server on port 8110..." TestListener() asyncore.loop() ! elif len(args) == 1: # Named POP3 server, default port. main(args[0], 110, 110, pickleName, useDB) ! elif len(args) == 2: # Named POP3 server, named port. main(args[0], int(args[1]), 110, pickleName, useDB) ! else: print >>sys.stderr, __doc__ Index: setup.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/setup.py,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** setup.py 7 Sep 2002 16:15:45 -0000 1.3 --- setup.py 23 Sep 2002 21:20:10 -0000 1.4 *************** *** 2,8 **** setup( ! name='spambayes', scripts=['unheader.py', 'hammie.py'], py_modules=['classifier', 'tokenizer'] ) - --- 2,7 ---- setup( ! name='spambayes', scripts=['unheader.py', 'hammie.py'], py_modules=['classifier', 'tokenizer'] ) Index: splitndirs.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/splitndirs.py,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** splitndirs.py 20 Sep 2002 20:00:45 -0000 1.3 --- splitndirs.py 23 Sep 2002 21:20:10 -0000 1.4 *************** *** 115,117 **** if __name__ == '__main__': main() - --- 115,116 ---- Index: unheader.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/unheader.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** unheader.py 22 Sep 2002 06:58:36 -0000 1.2 --- unheader.py 23 Sep 2002 21:20:10 -0000 1.3 *************** *** 29,56 **** def deSA(msg): if msg['X-Spam-Status']: ! if msg['X-Spam-Status'].startswith('Yes'): ! pct = msg['X-Spam-Prev-Content-Type'] ! if pct: ! msg['Content-Type'] = pct ! pcte = msg['X-Spam-Prev-Content-Transfer-Encoding'] ! if pcte: ! msg['Content-Transfer-Encoding'] = pcte ! subj = re.sub(r'\*\*\*\*\*SPAM\*\*\*\*\* ', '', msg['Subject']) if subj != msg["Subject"]: msg.replace_header("Subject", subj) ! body = msg.get_payload() ! newbody = [] ! at_start = 1 ! for line in body.splitlines(): ! if at_start and line.startswith('SPAM: '): ! continue ! elif at_start: ! at_start = 0 ! else: ! newbody.append(line) ! msg.set_payload("\n".join(newbody)) unheader(msg, "X-Spam-") --- 29,56 ---- def deSA(msg): if msg['X-Spam-Status']: ! if msg['X-Spam-Status'].startswith('Yes'): ! pct = msg['X-Spam-Prev-Content-Type'] ! if pct: ! msg['Content-Type'] = pct ! pcte = msg['X-Spam-Prev-Content-Transfer-Encoding'] ! if pcte: ! msg['Content-Transfer-Encoding'] = pcte ! subj = re.sub(r'\*\*\*\*\*SPAM\*\*\*\*\* ', '', msg['Subject']) if subj != msg["Subject"]: msg.replace_header("Subject", subj) ! body = msg.get_payload() ! newbody = [] ! at_start = 1 ! for line in body.splitlines(): ! if at_start and line.startswith('SPAM: '): ! continue ! elif at_start: ! at_start = 0 ! else: ! newbody.append(line) ! msg.set_payload("\n".join(newbody)) unheader(msg, "X-Spam-") From gvanrossum@users.sourceforge.net Mon Sep 23 22:46:37 2002 From: gvanrossum@users.sourceforge.net (Guido van Rossum) Date: Mon, 23 Sep 2002 14:46:37 -0700 Subject: [Spambayes-checkins] spambayes cmp.py,1.11,1.12 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv18710 Modified Files: cmp.py Log Message: Changed CRLF to LF. (Some, but not all line endings were CRLF since bkc's checkin.) There's also a bug here: I ran this with rates.py output from a previous version and it said UnboundLocalError: local variable 'hamdevall' referenced before assignment But I don't know what value to initialize it (and spamdevall) to. Index: cmp.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/cmp.py,v retrieving revision 1.11 retrieving revision 1.12 diff -C2 -d -r1.11 -r1.12 *** cmp.py 23 Sep 2002 21:20:10 -0000 1.11 --- cmp.py 23 Sep 2002 21:46:34 -0000 1.12 *************** *** 22,55 **** def suck(f): fns = [] ! fps = [] ! hamdev = [] ! spamdev = [] get = f.readline while 1: line = get() ! if line.startswith('-> tested'): print line, ! if line.find('sample sdev') != -1: ! vals = line.split(';') ! mean = float(vals[1].split(' ')[-1]) ! sdev = float(vals[2].split(' ')[-1]) ! val = (mean,sdev) ! typ = vals[0].split(' ')[2] ! if line.find('for all runs') != -1: ! if typ == 'Ham': ! hamdevall = val ! else: ! spamdevall = val ! elif line.find('all in this') != -1: ! if typ == 'Ham': ! hamdev.append(val) ! else: ! spamdev.append(val) continue if line.startswith('-> '): continue if line.startswith('total'): ! break # A line with an f-p rate and an f-n rate. p, n = map(float, line.split()) --- 22,55 ---- def suck(f): fns = [] ! fps = [] ! hamdev = [] ! spamdev = [] get = f.readline while 1: line = get() ! if line.startswith('-> tested'): print line, ! if line.find('sample sdev') != -1: ! vals = line.split(';') ! mean = float(vals[1].split(' ')[-1]) ! sdev = float(vals[2].split(' ')[-1]) ! val = (mean,sdev) ! typ = vals[0].split(' ')[2] ! if line.find('for all runs') != -1: ! if typ == 'Ham': ! hamdevall = val ! else: ! spamdevall = val ! elif line.find('all in this') != -1: ! if typ == 'Ham': ! hamdev.append(val) ! else: ! spamdev.append(val) continue if line.startswith('-> '): continue if line.startswith('total'): ! break # A line with an f-p rate and an f-n rate. p, n = map(float, line.split()) *************** *** 78,92 **** t += " +(was 0)" return t ! ! def mtag(m1,m2): ! mean1,dev1 = m1 ! mean2,dev2 = m2 ! mp = (mean2 - mean1) * 100.0 / mean1 ! dp = (dev2 - dev1) * 100.0 / dev1 ! ! return "%2.2f %2.2f (%+2.2f%%) %2.2f %2.2f (%+2.2f%%)" % ( ! mean1,mean2,mp, ! dev1,dev2,dp ! ) def dump(p1s, p2s): --- 78,92 ---- t += " +(was 0)" return t ! ! def mtag(m1,m2): ! mean1,dev1 = m1 ! mean2,dev2 = m2 ! mp = (mean2 - mean1) * 100.0 / mean1 ! dp = (dev2 - dev1) * 100.0 / dev1 ! ! return "%2.2f %2.2f (%+2.2f%%) %2.2f %2.2f (%+2.2f%%)" % ( ! mean1,mean2,mp, ! dev1,dev2,dp ! ) def dump(p1s, p2s): *************** *** 100,105 **** print "%-4s %2d times" % (t, alltags.count(t)) print ! ! def dumpdev(meandev1,meandev2): for m1,m2 in zip(meandev1,meandev2): print mtag(m1, m2) --- 100,105 ---- print "%-4s %2d times" % (t, alltags.count(t)) print ! ! def dumpdev(meandev1,meandev2): for m1,m2 in zip(meandev1,meandev2): print mtag(m1, m2) *************** *** 132,151 **** print "total unique fn went from", fntot1, "to", fntot2, tag(fntot1, fntot2) print "mean fn % went from", fnmean1, "to", fnmean2, tag(fnmean1, fnmean2) ! ! print ! print "ham mean ham sdev" ! dumpdev(hamdev1,hamdev2) ! print ! print "ham mean and sdev for all runs" ! dumpdev([hamdevall1],[hamdevall2]) ! ! print ! print "spam mean spam sdev" ! dumpdev(spamdev1,spamdev2) ! print ! print "spam mean and sdev for all runs" ! dumpdev([spamdevall1],[spamdevall2]) ! print ! diff1 = spamdevall1[0] - hamdevall1[0] ! diff2 = spamdevall2[0] - hamdevall2[0] ! print "ham/spam mean difference: %2.2f %2.2f %+2.2f" % (diff1,diff2,(diff2-diff1)) --- 132,151 ---- print "total unique fn went from", fntot1, "to", fntot2, tag(fntot1, fntot2) print "mean fn % went from", fnmean1, "to", fnmean2, tag(fnmean1, fnmean2) ! ! print ! print "ham mean ham sdev" ! dumpdev(hamdev1,hamdev2) ! print ! print "ham mean and sdev for all runs" ! dumpdev([hamdevall1],[hamdevall2]) ! ! print ! print "spam mean spam sdev" ! dumpdev(spamdev1,spamdev2) ! print ! print "spam mean and sdev for all runs" ! dumpdev([spamdevall1],[spamdevall2]) ! print ! diff1 = spamdevall1[0] - hamdevall1[0] ! diff2 = spamdevall2[0] - hamdevall2[0] ! print "ham/spam mean difference: %2.2f %2.2f %+2.2f" % (diff1,diff2,(diff2-diff1)) From bkc@users.sourceforge.net Mon Sep 23 23:41:06 2002 From: bkc@users.sourceforge.net (Brad Clements) Date: Mon, 23 Sep 2002 15:41:06 -0700 Subject: [Spambayes-checkins] spambayes TestDriver.py,1.8,1.9 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv3177 Modified Files: TestDriver.py Log Message: allow global ham and spam histogram to be saved to a binary pickle Index: TestDriver.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v retrieving revision 1.8 retrieving revision 1.9 diff -C2 -d -r1.8 -r1.9 *** TestDriver.py 23 Sep 2002 21:19:08 -0000 1.8 --- TestDriver.py 23 Sep 2002 22:41:04 -0000 1.9 *************** *** 165,168 **** --- 165,176 ---- if options.show_histograms: printhist("all runs:", self.global_ham_hist, self.global_spam_hist) + + if options.save_histogram_pickles: + for f, h in (('ham', self.global_ham_hist), ('spam', self.global_spam_hist)): + fname = "%s_%shist.pik" % (options.pickle_basename, f) + print " saving %s histogram pickle to %s" %(f, fname) + fp = file(fname, 'wb') + pickle.dump(h, fp, 1) + fp.close() def test(self, ham, spam): From bkc@users.sourceforge.net Mon Sep 23 23:41:55 2002 From: bkc@users.sourceforge.net (Brad Clements) Date: Mon, 23 Sep 2002 15:41:55 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.26,1.27 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv3385 Modified Files: Options.py Log Message: Add option to save global spam and ham history to pickles Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.26 retrieving revision 1.27 diff -C2 -d -r1.26 -r1.27 *** Options.py 23 Sep 2002 21:20:10 -0000 1.26 --- Options.py 23 Sep 2002 22:41:52 -0000 1.27 *************** *** 141,147 **** # name already exists, it's overwritten. pickle_basename is ignored when # save_trained_pickles is false. save_trained_pickles: False ! pickle_basename: class [Classifier] --- 141,153 ---- # name already exists, it's overwritten. pickle_basename is ignored when # save_trained_pickles is false. + + # if save_histogram_pickles is true, Driver.train() saves a binary + # pickle of the spam and ham histogram for "all test runs". The file + # basename is given by pickle_basename, the suffix _spamhist.pik + # or _hamhist.pik is appended to the basename. save_trained_pickles: False ! pickle_basename: class ! save_histogram_pickles: False [Classifier] *************** *** 226,229 **** --- 232,236 ---- 'show_best_discriminators': int_cracker, 'save_trained_pickles': boolean_cracker, + 'save_histogram_pickles': boolean_cracker, 'pickle_basename': string_cracker, 'show_charlimit': int_cracker, From bkc@users.sourceforge.net Tue Sep 24 00:30:09 2002 From: bkc@users.sourceforge.net (Brad Clements) Date: Mon, 23 Sep 2002 16:30:09 -0700 Subject: [Spambayes-checkins] spambayes HistToGNU.py,NONE,1.1 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv16649 Added Files: HistToGNU.py Log Message: Initial version, convert hist pickles to gnuplot input --- NEW FILE: HistToGNU.py --- #! /usr/bin/env python """HistToGNU.py Convert saved binary pickle of histograms to gnu plot output """ """Usage: %(program)s [options] [histogrampicklefile ...] reads pickle filename from options if not specified writes to stdout """ globalOptions = """ set grid set xtics 5 set xrange [0.0:100.0] """ dataSetOptions="smooth unique" from Options import options from TestDriver import Hist import sys import cPickle as pickle program = sys.argv[0] def usage(code, msg=''): """Print usage message and sys.exit(code).""" if msg: print >> sys.stderr, msg print >> sys.stderr print >> sys.stderr, __doc__ % globals() sys.exit(code) def loadHist(path): """Load the histogram pickle object""" return pickle.load(file(path)) def outputHist(hist,f=sys.stdout): """Output the Hist object to file f""" for i in range(len(hist.buckets)): n = hist.buckets[i] if n: f.write("%.3f %d\n" % ( (100.0 * i) / hist.nbuckets, n)) def plot(files): """given a list of files, create gnu-plot file""" import cStringIO, os cmd = cStringIO.StringIO() cmd.write(globalOptions) args = [] for file in files: args.append("""'-' %s title "%s" """ % (dataSetOptions,file)) cmd.write('plot %s\n' % ",".join(args)) for file in files: outputHist(loadHist(file),cmd) cmd.write('e\n') cmd.write('pause 100\n') print cmd.getvalue() def main(): import getopt try: opts, args = getopt.getopt(sys.argv[1:], '', []) except getopt.error, msg: usage(1, msg) if not args and options.save_histogram_pickles: args = [] for f in ('ham', 'spam'): fname = "%s_%shist.pik" % (options.pickle_basename, f) args.append(fname) if args: plot(args) else: print "could not locate any files to plot" if __name__ == "__main__": main() From tim_one@users.sourceforge.net Tue Sep 24 01:37:34 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Mon, 23 Sep 2002 17:37:34 -0700 Subject: [Spambayes-checkins] spambayes README.txt,1.23,1.24 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv1923 Modified Files: README.txt Log Message: Updated the blurb about requiring 2.3. Index: README.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/README.txt,v retrieving revision 1.23 retrieving revision 1.24 diff -C2 -d -r1.23 -r1.24 *** README.txt 23 Sep 2002 20:03:06 -0000 1.23 --- README.txt 24 Sep 2002 00:37:32 -0000 1.24 *************** *** 24,29 **** too small to measure reliably across that much training data. ! The code here depends in various ways on the latest Python from CVS ! (a.k.a. Python 2.3a0 :-). --- 24,28 ---- too small to measure reliably across that much training data. ! The code in this project requires Python 2.2.1 (or later). From tim_one@users.sourceforge.net Tue Sep 24 01:38:39 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Mon, 23 Sep 2002 17:38:39 -0700 Subject: [Spambayes-checkins] spambayes hammie.py,1.20,1.21 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv2032 Modified Files: hammie.py Log Message: Removed an obsolete 2.3 comment -- or maybe it isn't obsolete? If hammie.py really requires 2.3, somebody put the comment back in . Index: hammie.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammie.py,v retrieving revision 1.20 retrieving revision 1.21 diff -C2 -d -r1.20 -r1.21 *** hammie.py 23 Sep 2002 21:20:10 -0000 1.20 --- hammie.py 24 Sep 2002 00:38:37 -0000 1.21 *************** *** 1,4 **** #! /usr/bin/env python - # At the moment, this requires Python 2.3 from CVS # A driver for the classifier module and Tim's tokenizer that you can --- 1,3 ---- From tim_one@users.sourceforge.net Tue Sep 24 01:39:08 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Mon, 23 Sep 2002 17:39:08 -0700 Subject: [Spambayes-checkins] spambayes HistToGNU.py,1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv2336 Modified Files: HistToGNU.py Log Message: Whitespace normalization. Index: HistToGNU.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/HistToGNU.py,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** HistToGNU.py 23 Sep 2002 23:30:07 -0000 1.1 --- HistToGNU.py 24 Sep 2002 00:39:06 -0000 1.2 *************** *** 88,90 **** if __name__ == "__main__": main() - --- 88,89 ---- From guido@python.org Tue Sep 24 01:58:08 2002 From: guido@python.org (Guido van Rossum) Date: Mon, 23 Sep 2002 20:58:08 -0400 Subject: [Spambayes-checkins] spambayes hammie.py,1.20,1.21 In-Reply-To: Your message of "Mon, 23 Sep 2002 17:38:39 PDT." References: Message-ID: <200209240058.g8O0w8o20276@pcp02138704pcs.reston01.va.comcast.net> > Removed an obsolete 2.3 comment -- or maybe it isn't obsolete? If > hammie.py really requires 2.3, somebody put the comment back in . No, I tested it successfully with 2.2.1 last night. --Guido van Rossum (home page: http://www.python.org/~guido/) From tim_one@users.sourceforge.net Tue Sep 24 04:29:51 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Mon, 23 Sep 2002 20:29:51 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.27,1.28 classifier.py,1.18,1.19 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv9255 Modified Files: Options.py classifier.py Log Message: New option use_central_limit2 is Gary Robin's logarithmic variation of the central-limit code. Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.27 retrieving revision 1.28 diff -C2 -d -r1.27 -r1.28 *** Options.py 23 Sep 2002 22:41:52 -0000 1.27 --- Options.py 24 Sep 2002 03:29:48 -0000 1.28 *************** *** 203,206 **** --- 203,210 ---- # square roots. An NxN test grid should work fine. use_central_limit: False + + # Same as use_central_limit, except takes logarithms of probabilities and + # probability complements (p and 1-p) instead. + use_central_limit2: False """ *************** *** 251,254 **** --- 255,259 ---- 'use_central_limit': boolean_cracker, + 'use_central_limit2': boolean_cracker, }, } Index: classifier.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/classifier.py,v retrieving revision 1.18 retrieving revision 1.19 diff -C2 -d -r1.18 -r1.19 *** classifier.py 23 Sep 2002 21:19:08 -0000 1.18 --- classifier.py 24 Sep 2002 03:29:48 -0000 1.19 *************** *** 743,744 **** --- 743,838 ---- if options.use_central_limit: spamprob = central_limit_spamprob + + + + + def central_limit_compute_population_stats2(self, msgstream, is_spam): + from math import ldexp, log + + sum = sumsq = 0 + seen = {} + for msg in msgstream: + for prob, word, record in self._getclues(msg): + if word in seen: + continue + seen[word] = 1 + if is_spam: + prob = log(prob) + else: + prob = log(1.0 - prob) + prob = long(ldexp(prob, 64)) + sum += prob + sumsq += prob * prob + n = len(seen) + + if is_spam: + self.spamn, self.spamsum, self.spamsumsq = n, sum, sumsq + spamsum = self.spamsum + self.spammean = ldexp(spamsum, -64) / self.spamn + spamvar = self.spamsumsq * self.spamn - spamsum**2 + self.spamvar = ldexp(spamvar, -128) / (self.spamn ** 2) + print 'spammean', self.spammean, 'spamvar', self.spamvar + else: + self.hamn, self.hamsum, self.hamsumsq = n, sum, sumsq + hamsum = self.hamsum + self.hammean = ldexp(hamsum, -64) / self.hamn + hamvar = self.hamsumsq * self.hamn - hamsum**2 + self.hamvar = ldexp(hamvar, -128) / (self.hamn ** 2) + print 'hammean', self.hammean, 'hamvar', self.hamvar + + if options.use_central_limit2: + compute_population_stats = central_limit_compute_population_stats2 + + def central_limit_spamprob2(self, wordstream, evidence=False): + """Return best-guess probability that wordstream is spam. + + wordstream is an iterable object producing words. + The return value is a float in [0.0, 1.0]. + + If optional arg evidence is True, the return value is a pair + probability, evidence + where evidence is a list of (word, probability) pairs. + """ + + from math import sqrt, log + + clues = self._getclues(wordstream) + hsum = ssum = 0.0 + for prob, word, record in clues: + ssum += log(prob) + hsum += log(1.0 - prob) + if record is not None: + record.killcount += 1 + n = len(clues) + if n == 0: + return 0.5 + hmean = hsum / n + smean = ssum / n + + # If this sample is drawn from the spam population, its mean is + # distributed around spammean with variance spamvar/n. Likewise + # for if it's drawn from the ham population. Compute a normalized + # z-score (how many stddevs is it away from the population mean?) + # against both populations, and then it's ham or spam depending + # on which population it matches better. + zham = (hmean - self.hammean) / sqrt(self.hamvar / n) + zspam = (smean - self.spammean) / sqrt(self.spamvar / n) + stat = abs(zham) - abs(zspam) # > 0 for spam, < 0 for ham + + # Normalize into [0, 1]. I'm arbitrarily clipping it to fit in + # [-20, 20] first. 20 is a massive z-score difference. + if stat < -20.0: + stat = -20.0 + elif stat > 20.0: + stat = 20.0 + stat = 0.5 + stat / 40.0 + + if evidence: + clues = [(word, prob) for prob, word, record in clues] + clues.sort(lambda a, b: cmp(a[1], b[1])) + return stat, clues + else: + return stat + + if options.use_central_limit2: + spamprob = central_limit_spamprob2 From anthonybaxter@users.sourceforge.net Tue Sep 24 06:37:14 2002 From: anthonybaxter@users.sourceforge.net (Anthony Baxter) Date: Mon, 23 Sep 2002 22:37:14 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.28,1.29 timcv.py,1.8,1.9 timtest.py,1.28,1.29 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv5802 Modified Files: Options.py timcv.py timtest.py Log Message: Made the Data/Ham/SetN and Data/Spam/SetN things options that can be over-ridden. Don't see why the rest of us should things this way just because Tim thinks it's the correct way to do things More importantly, means you can do test runs with different corpuses (corpuscles? corpi? corpen?) at the same time. Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.28 retrieving revision 1.29 diff -C2 -d -r1.28 -r1.29 *** Options.py 24 Sep 2002 03:29:48 -0000 1.28 --- Options.py 24 Sep 2002 05:37:11 -0000 1.29 *************** *** 151,154 **** --- 151,160 ---- save_histogram_pickles: False + # default locations for timcv and timtest - these get the set number + # appended. + spam_directories: Data/Spam/Set%d + ham_directories: Data/Ham/Set%d + + [Classifier] # Fiddling these can have extreme effects. See classifier.py for comments. *************** *** 240,243 **** --- 246,251 ---- 'show_charlimit': int_cracker, 'spam_cutoff': float_cracker, + 'spam_directories': string_cracker, + 'ham_directories': string_cracker, }, 'Classifier': {'hambias': float_cracker, Index: timcv.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timcv.py,v retrieving revision 1.8 retrieving revision 1.9 diff -C2 -d -r1.8 -r1.9 *** timcv.py 23 Sep 2002 20:18:34 -0000 1.8 --- timcv.py 24 Sep 2002 05:37:11 -0000 1.9 *************** *** 52,57 **** print options.display() ! hamdirs = ["Data/Ham/Set%d" % i for i in range(1, nsets+1)] ! spamdirs = ["Data/Spam/Set%d" % i for i in range(1, nsets+1)] d = TestDriver.Driver() --- 52,57 ---- print options.display() ! hamdirs = [options.ham_directories % i for i in range(1, nsets+1)] ! spamdirs = [options.spam_directories % i for i in range(1, nsets+1)] d = TestDriver.Driver() Index: timtest.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/timtest.py,v retrieving revision 1.28 retrieving revision 1.29 diff -C2 -d -r1.28 -r1.29 *** timtest.py 23 Sep 2002 20:18:34 -0000 1.28 --- timtest.py 24 Sep 2002 05:37:11 -0000 1.29 *************** *** 54,59 **** print options.display() ! spamdirs = ["Data/Spam/Set%d" % i for i in range(1, nsets+1)] ! hamdirs = ["Data/Ham/Set%d" % i for i in range(1, nsets+1)] spamhamdirs = zip(spamdirs, hamdirs) --- 54,59 ---- print options.display() ! spamdirs = [options.spam_directories % i for i in range(1, nsets+1)] ! hamdirs = [options.ham_directories % i for i in range(1, nsets+1)] spamhamdirs = zip(spamdirs, hamdirs) From anthonybaxter@users.sourceforge.net Tue Sep 24 06:37:56 2002 From: anthonybaxter@users.sourceforge.net (Anthony Baxter) Date: Mon, 23 Sep 2002 22:37:56 -0700 Subject: [Spambayes-checkins] spambayes/email .cvsignore,NONE,1.1 Message-ID: Update of /cvsroot/spambayes/spambayes/email In directory usw-pr-cvs1:/tmp/cvs-serv6305 Added Files: .cvsignore Log Message: silence mr. cvs --- NEW FILE: .cvsignore --- *.pyc *.pyo From anthonybaxter@users.sourceforge.net Tue Sep 24 07:13:32 2002 From: anthonybaxter@users.sourceforge.net (Anthony Baxter) Date: Mon, 23 Sep 2002 23:13:32 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.29,1.30 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv13054 Modified Files: Options.py Log Message: corrected comment. fixed line endings (mixed dos and unix, ick) Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.29 retrieving revision 1.30 diff -C2 -d -r1.29 -r1.30 *** Options.py 24 Sep 2002 05:37:11 -0000 1.29 --- Options.py 24 Sep 2002 06:13:29 -0000 1.30 *************** *** 141,159 **** # name already exists, it's overwritten. pickle_basename is ignored when # save_trained_pickles is false. ! ! # if save_histogram_pickles is true, Driver.train() saves a binary ! # pickle of the spam and ham histogram for "all test runs". The file ! # basename is given by pickle_basename, the suffix _spamhist.pik ! # or _hamhist.pik is appended to the basename. save_trained_pickles: False ! pickle_basename: class save_histogram_pickles: False # default locations for timcv and timtest - these get the set number ! # appended. spam_directories: Data/Spam/Set%d ham_directories: Data/Ham/Set%d - [Classifier] --- 141,158 ---- # name already exists, it's overwritten. pickle_basename is ignored when # save_trained_pickles is false. ! ! # if save_histogram_pickles is true, Driver.train() saves a binary ! # pickle of the spam and ham histogram for "all test runs". The file ! # basename is given by pickle_basename, the suffix _spamhist.pik ! # or _hamhist.pik is appended to the basename. save_trained_pickles: False ! pickle_basename: class save_histogram_pickles: False # default locations for timcv and timtest - these get the set number ! # interpolated. spam_directories: Data/Spam/Set%d ham_directories: Data/Ham/Set%d [Classifier] From anthonybaxter@users.sourceforge.net Tue Sep 24 09:16:26 2002 From: anthonybaxter@users.sourceforge.net (Anthony Baxter) Date: Tue, 24 Sep 2002 01:16:26 -0700 Subject: [Spambayes-checkins] spambayes README.txt,1.24,1.25 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv15673 Modified Files: README.txt Log Message: note about unheader.py Index: README.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/README.txt,v retrieving revision 1.24 retrieving revision 1.25 diff -C2 -d -r1.24 -r1.25 *** README.txt 24 Sep 2002 00:37:32 -0000 1.24 --- README.txt 24 Sep 2002 08:16:24 -0000 1.25 *************** *** 129,132 **** --- 129,134 ---- A script to remove unwanted headers from an mbox file. This is mostly useful to delete headers which incorrectly might bias the results. + In default mode, this is similar to 'spamassassin -d', but much, much + faster. loosecksum.py From sjoerd@users.sourceforge.net Tue Sep 24 12:43:09 2002 From: sjoerd@users.sourceforge.net (Sjoerd Mullender) Date: Tue, 24 Sep 2002 04:43:09 -0700 Subject: [Spambayes-checkins] spambayes cmp.py,1.12,1.13 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv11535 Modified Files: cmp.py Log Message: Protect against a mean of 0. Index: cmp.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/cmp.py,v retrieving revision 1.12 retrieving revision 1.13 diff -C2 -d -r1.12 -r1.13 *** cmp.py 23 Sep 2002 21:46:34 -0000 1.12 --- cmp.py 24 Sep 2002 11:43:06 -0000 1.13 *************** *** 82,92 **** mean1,dev1 = m1 mean2,dev2 = m2 ! mp = (mean2 - mean1) * 100.0 / mean1 ! dp = (dev2 - dev1) * 100.0 / dev1 ! ! return "%2.2f %2.2f (%+2.2f%%) %2.2f %2.2f (%+2.2f%%)" % ( ! mean1,mean2,mp, ! dev1,dev2,dp ! ) def dump(p1s, p2s): --- 82,98 ---- mean1,dev1 = m1 mean2,dev2 = m2 ! t = "%7.2f %7.2f " % (mean1, mean2) ! if mean1: ! mp = (mean2 - mean1) * 100.0 / mean1 ! t += "%+7.2f%%" % mp ! else: ! t += "+(was 0)" ! t += " %7.2f %7.2f " % (dev1, dev2) ! if dev1: ! dp = (dev2 - dev1) * 100.0 / dev1 ! t += "%+7.2f%%" % dp ! else: ! t += "+(was 0)" ! return t def dump(p1s, p2s): *************** *** 134,138 **** print ! print "ham mean ham sdev" dumpdev(hamdev1,hamdev2) print --- 140,144 ---- print ! print "ham mean ham sdev" dumpdev(hamdev1,hamdev2) print *************** *** 141,145 **** print ! print "spam mean spam sdev" dumpdev(spamdev1,spamdev2) print --- 147,151 ---- print ! print "spam mean spam sdev" dumpdev(spamdev1,spamdev2) print From bkc@users.sourceforge.net Tue Sep 24 15:38:14 2002 From: bkc@users.sourceforge.net (Brad Clements) Date: Tue, 24 Sep 2002 07:38:14 -0700 Subject: [Spambayes-checkins] spambayes HistToGNU.py,1.2,1.3 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv6641 Modified Files: HistToGNU.py Log Message: Fix wrong __doc__ for usage Index: HistToGNU.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/HistToGNU.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** HistToGNU.py 24 Sep 2002 00:39:06 -0000 1.2 --- HistToGNU.py 24 Sep 2002 14:38:10 -0000 1.3 *************** *** 5,11 **** Convert saved binary pickle of histograms to gnu plot output ! """ ! ! """Usage: %(program)s [options] [histogrampicklefile ...] reads pickle filename from options if not specified --- 5,9 ---- Convert saved binary pickle of histograms to gnu plot output ! Usage: %(program)s [options] [histogrampicklefile ...] reads pickle filename from options if not specified *************** *** 57,64 **** args = [] for file in files: ! args.append("""'-' %s title "%s" """ % (dataSetOptions,file)) cmd.write('plot %s\n' % ",".join(args)) for file in files: ! outputHist(loadHist(file),cmd) cmd.write('e\n') --- 55,62 ---- args = [] for file in files: ! args.append("""'-' %s title "%s" """ % (dataSetOptions, file)) cmd.write('plot %s\n' % ",".join(args)) for file in files: ! outputHist(loadHist(file), cmd) cmd.write('e\n') From tim.one@comcast.net Tue Sep 24 18:00:01 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 24 Sep 2002 13:00:01 -0400 Subject: [Spambayes-checkins] spambayes Options.py,1.28,1.29 timcv.py,1.8,1.9 timtest.py,1.28,1.29 In-Reply-To: Message-ID: [Anthony Baxter] > ... > Log Message: > Made the Data/Ham/SetN and Data/Spam/SetN things options that can be > over-ridden. Don't see why the rest of us should things this way > just because Tim thinks it's the correct way to do things > > More importantly, means you can do test runs with different corpuses > (corpuscles? corpi? corpen?) at the same time. It's a good change -- thanks. Before this, I simply renamed my directories. Don't think that I haven't noticed you're complaining elsewhere that you can't run even one test at a time . From montanaro@users.sourceforge.net Tue Sep 24 19:00:00 2002 From: montanaro@users.sourceforge.net (Skip Montanaro) Date: Tue, 24 Sep 2002 11:00:00 -0700 Subject: [Spambayes-checkins] spambayes unheader.py,1.3,1.4 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv25992 Modified Files: unheader.py Log Message: guarantee at least an empty string for the subject Index: unheader.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/unheader.py,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** unheader.py 23 Sep 2002 21:20:10 -0000 1.3 --- unheader.py 24 Sep 2002 17:59:58 -0000 1.4 *************** *** 38,42 **** msg['Content-Transfer-Encoding'] = pcte ! subj = re.sub(r'\*\*\*\*\*SPAM\*\*\*\*\* ', '', msg['Subject']) if subj != msg["Subject"]: msg.replace_header("Subject", subj) --- 38,43 ---- msg['Content-Transfer-Encoding'] = pcte ! subj = re.sub(r'\*\*\*\*\*SPAM\*\*\*\*\* ', '', ! msg['Subject'] or "") if subj != msg["Subject"]: msg.replace_header("Subject", subj) From montanaro@users.sourceforge.net Tue Sep 24 19:07:19 2002 From: montanaro@users.sourceforge.net (Skip Montanaro) Date: Tue, 24 Sep 2002 11:07:19 -0700 Subject: [Spambayes-checkins] spambayes setup.py,1.4,1.5 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv29473 Modified Files: setup.py Log Message: add a bunch more modules and scripts Index: setup.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/setup.py,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** setup.py 23 Sep 2002 21:20:10 -0000 1.4 --- setup.py 24 Sep 2002 18:07:17 -0000 1.5 *************** *** 2,7 **** setup( ! name='spambayes', ! scripts=['unheader.py', 'hammie.py'], ! py_modules=['classifier', 'tokenizer'] ) --- 2,21 ---- setup( ! name='spambayes', ! scripts=['unheader.py', ! 'hammie.py', ! 'loosecksum.py', ! 'timtest.py', ! 'timcv.py', ! 'splitndirs.py', ! 'runtest.sh', ! 'rebal.py', ! 'cmp.py', ! 'rates.py'], ! py_modules=['classifier', ! 'tokenizer', ! 'Options', ! 'Tester', ! 'TestDriver', ! 'mboxutils'] ) From gvanrossum@users.sourceforge.net Tue Sep 24 19:26:13 2002 From: gvanrossum@users.sourceforge.net (Guido van Rossum) Date: Tue, 24 Sep 2002 11:26:13 -0700 Subject: [Spambayes-checkins] spambayes splitndirs.py,1.4,1.5 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv5034 Modified Files: splitndirs.py Log Message: Add -g option to glob each input path. This is handy on Windows. Patch contributed by Alexander Leidinger. Index: splitndirs.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/splitndirs.py,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** splitndirs.py 23 Sep 2002 21:20:10 -0000 1.4 --- splitndirs.py 24 Sep 2002 18:26:11 -0000 1.5 *************** *** 3,7 **** """Split an mbox into N random directories of files. ! Usage: %(program)s [-h] [-s seed] [-v] -n N sourcembox ... outdirbase Options: --- 3,7 ---- """Split an mbox into N random directories of files. ! Usage: %(program)s [-h] [-g] [-s seed] [-v] -n N sourcembox ... outdirbase Options: *************** *** 9,12 **** --- 9,17 ---- Print this help message and exit + -g + Do globbing on each sourcepath. This is helpful on Windows, where + the native shells don't glob, or when you have more mboxes than + your shell allows you to specify on the commandline. + -s seed Seed the random number generator with seed (an integer). *************** *** 22,26 **** Arguments: sourcembox ! The mbox to split. outdirbase --- 27,31 ---- Arguments: sourcembox ! The mbox or path to an mbox to split. outdirbase *************** *** 46,49 **** --- 51,55 ---- import email import getopt + import glob import mboxutils *************** *** 65,72 **** def main(): try: ! opts, args = getopt.getopt(sys.argv[1:], 'hn:s:v', ['help']) except getopt.error, msg: usage(1, msg) n = None verbose = False --- 71,79 ---- def main(): try: ! opts, args = getopt.getopt(sys.argv[1:], 'hgn:s:v', ['help']) except getopt.error, msg: usage(1, msg) + doglob = False n = None verbose = False *************** *** 74,77 **** --- 81,86 ---- if opt in ('-h', '--help'): usage(0) + elif opt == '-g': + doglob = True elif opt == '-s': random.seed(int(arg)) *************** *** 95,111 **** counter = 0 for inputpath in inputpaths: ! mbox = mboxutils.getmbox(inputpath) ! for msg in mbox: ! i = random.randrange(n) ! astext = str(msg) ! #assert astext.endswith('\n') ! counter += 1 ! msgfile = open('%s/%d' % (outdirs[i], counter), 'wb') ! msgfile.write(astext) ! msgfile.close() ! if verbose: ! if counter % 100 == 0: ! sys.stdout.write('.') ! sys.stdout.flush() if verbose: --- 104,126 ---- counter = 0 for inputpath in inputpaths: ! if doglob: ! inpaths = glob.glob(inputpath) ! else: ! inpaths = [inputpath] ! ! for inpath in inpaths: ! mbox = mboxutils.getmbox(inpath) ! for msg in mbox: ! i = random.randrange(n) ! astext = str(msg) ! #assert astext.endswith('\n') ! counter += 1 ! msgfile = open('%s/%d' % (outdirs[i], counter), 'wb') ! msgfile.write(astext) ! msgfile.close() ! if verbose: ! if counter % 100 == 0: ! sys.stdout.write('.') ! sys.stdout.flush() if verbose: From anthony@interlink.com.au Tue Sep 24 23:00:27 2002 From: anthony@interlink.com.au (Anthony Baxter) Date: Wed, 25 Sep 2002 08:00:27 +1000 Subject: [Spambayes-checkins] spambayes Options.py,1.28,1.29 timcv.py,1.8,1.9 timtest.py,1.28,1.29 In-Reply-To: Message-ID: <200209242200.g8OM0RV19871@localhost.localdomain> >>> Tim Peters wrote > It's a good change -- thanks. Before this, I simply renamed my directories. > Don't think that I haven't noticed you're complaining elsewhere that you > can't run even one test at a time . Ha! Since when has consistency been an issue? I'm actually doing tests with my smaller corpus of my personal spam+ham, trying out the different sized spam:ham ratios. Anthony -- Anthony Baxter It's never too late to have a happy childhood. From tim_one@users.sourceforge.net Tue Sep 24 23:13:21 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Tue, 24 Sep 2002 15:13:21 -0700 Subject: [Spambayes-checkins] spambayes TestDriver.py,1.9,1.10 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv15036 Modified Files: TestDriver.py Log Message: Changed the first histogram line so it fits in 79 columns. Index: TestDriver.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v retrieving revision 1.9 retrieving revision 1.10 diff -C2 -d -r1.9 -r1.10 *** TestDriver.py 23 Sep 2002 22:41:04 -0000 1.9 --- TestDriver.py 24 Sep 2002 22:13:19 -0000 1.10 *************** *** 90,98 **** def printhist(tag, ham, spam): print ! print "-> Ham distribution for", tag, ham.display() print ! print "-> Spam distribution for", tag, spam.display() --- 90,98 ---- def printhist(tag, ham, spam): print ! print "-> Ham scores for", tag, ham.display() print ! print "-> Spam scores for", tag, spam.display() From tim_one@users.sourceforge.net Tue Sep 24 23:14:04 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Tue, 24 Sep 2002 15:14:04 -0700 Subject: [Spambayes-checkins] spambayes classifier.py,1.19,1.20 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv15524 Modified Files: classifier.py Log Message: central_limit_compute_population_stats2(): Squashed code duplication. Index: classifier.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/classifier.py,v retrieving revision 1.19 retrieving revision 1.20 diff -C2 -d -r1.19 -r1.20 *** classifier.py 24 Sep 2002 03:29:48 -0000 1.19 --- classifier.py 24 Sep 2002 22:14:01 -0000 1.20 *************** *** 483,486 **** --- 483,488 ---- # XXX More stuff should be reworked to use this as a helper function. def _getclues(self, wordstream): + mindist = options.robinson_minimum_prob_strength + # A priority queue to remember the MAX_DISCRIMINATORS best # probabilities, where "best" means largest distance from 0.5. *************** *** 500,504 **** distance = abs(prob - 0.5) ! if distance > smallest_best: heapreplace(nbest, (distance, prob, word, record)) smallest_best = nbest[0][0] --- 502,506 ---- distance = abs(prob - 0.5) ! if distance >= mindist and distance > smallest_best: heapreplace(nbest, (distance, prob, word, record)) smallest_best = nbest[0][0] *************** *** 764,782 **** sum += prob sumsq += prob * prob n = len(seen) if is_spam: self.spamn, self.spamsum, self.spamsumsq = n, sum, sumsq ! spamsum = self.spamsum ! self.spammean = ldexp(spamsum, -64) / self.spamn ! spamvar = self.spamsumsq * self.spamn - spamsum**2 ! self.spamvar = ldexp(spamvar, -128) / (self.spamn ** 2) print 'spammean', self.spammean, 'spamvar', self.spamvar else: self.hamn, self.hamsum, self.hamsumsq = n, sum, sumsq ! hamsum = self.hamsum ! self.hammean = ldexp(hamsum, -64) / self.hamn ! hamvar = self.hamsumsq * self.hamn - hamsum**2 ! self.hamvar = ldexp(hamvar, -128) / (self.hamn ** 2) print 'hammean', self.hammean, 'hamvar', self.hamvar --- 766,782 ---- sum += prob sumsq += prob * prob + n = len(seen) + mean = ldexp(sum, -64) / n + var = sumsq * n - sum**2 + var = ldexp(var, -128) / n**2 if is_spam: self.spamn, self.spamsum, self.spamsumsq = n, sum, sumsq ! self.spammean, self.spamvar = mean, var print 'spammean', self.spammean, 'spamvar', self.spamvar else: self.hamn, self.hamsum, self.hamsumsq = n, sum, sumsq ! self.hammean, self.hamvar = mean, var print 'hammean', self.hammean, 'hamvar', self.hamvar From gvanrossum@users.sourceforge.net Wed Sep 25 02:01:51 2002 From: gvanrossum@users.sourceforge.net (Guido van Rossum) Date: Tue, 24 Sep 2002 18:01:51 -0700 Subject: [Spambayes-checkins] spambayes fpfn.py,NONE,1.1 README.txt,1.25,1.26 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv30699 Modified Files: README.txt Added Files: fpfn.py Log Message: Add a tiny utility to extract the filenames of false positives/negatives from the full test run output. (Tested with timcv.py output only.) --- NEW FILE: fpfn.py --- #! /usr/bin/env python """Extract false positive and false negative filenames from timcv.py output.""" import sys import re def cmpf(a, b): # Sort function that sorts by numerical value ma = re.search(r'(\d+)/(\d+)$', a) mb = re.search(r'(\d+)/(\d+)$', b) if ma and mb: xa, ya = map(int, ma.groups()) xb, yb = map(int, mb.groups()) return cmp((xa, ya), (xb, yb)) else: return cmp(a, b) def main(): for name in sys.argv[1:]: try: f = open(name + ".txt") except IOError: f = open(name) print "===", name, "===" fp = [] fn = [] for line in f: if line.startswith(' new fp: '): fp.extend(eval(line[12:])) elif line.startswith(' new fn: '): fn.extend(eval(line[12:])) fp.sort(cmpf) fn.sort(cmpf) print "--- fp ---" for x in fp: print x print "--- fn ---" for x in fn: print x if __name__ == '__main__': main() Index: README.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/README.txt,v retrieving revision 1.25 retrieving revision 1.26 diff -C2 -d -r1.25 -r1.26 *** README.txt 24 Sep 2002 08:16:24 -0000 1.25 --- README.txt 25 Sep 2002 01:01:49 -0000 1.26 *************** *** 119,122 **** --- 119,126 ---- and the change in average f-p and f-n rates. + fpfn.py + Given one or more TestDriver output files, prints list of false + positive and false negative filenames, one per line. + Test Data Utilities From tim.one@comcast.net Wed Sep 25 02:21:07 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 24 Sep 2002 21:21:07 -0400 Subject: [Spambayes-checkins] spambayes fpfn.py,NONE,1.1 README.txt,1.25,1.26 In-Reply-To: Message-ID: [Guido] > Modified Files: > README.txt > Added Files: > fpfn.py > Log Message: > Add a tiny utility to extract the filenames of false positives/negatives > from the full test run output. (Tested with timcv.py output only.) The good news is that timcv doesn't print anything, except to dump out all the options in effect at the start. All the printing is done by the TestDriver module, and all the test drivers (timcv, timtest, mboxtest) use that. So you've solved this problem for all of them! There's much method behind all the seeming madness here . From gward@users.sourceforge.net Wed Sep 25 03:02:43 2002 From: gward@users.sourceforge.net (Greg Ward) Date: Tue, 24 Sep 2002 19:02:43 -0700 Subject: [Spambayes-checkins] spambayes unheader.py,1.4,1.5 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv19599 Modified Files: unheader.py Log Message: Fix deSA() so it doesn't discard the first line of the body. Change process_mailbox() to use email.Generator directly, in order to disable header-wrapping and preserve headers as much as possible. Index: unheader.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/unheader.py,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** unheader.py 24 Sep 2002 17:59:58 -0000 1.4 --- unheader.py 25 Sep 2002 02:02:41 -0000 1.5 *************** *** 6,9 **** --- 6,10 ---- import email.Parser import email.Message + import email.Generator import getopt *************** *** 51,60 **** elif at_start: at_start = 0 ! else: ! newbody.append(line) msg.set_payload("\n".join(newbody)) unheader(msg, "X-Spam-") def process_mailbox(f, dosa=1, pats=None): for msg in mailbox.PortableUnixMailbox(f, Parser().parse): if pats is not None: --- 52,61 ---- elif at_start: at_start = 0 ! newbody.append(line) msg.set_payload("\n".join(newbody)) unheader(msg, "X-Spam-") def process_mailbox(f, dosa=1, pats=None): + gen = email.Generator.Generator(sys.stdout, maxheaderlen=0) for msg in mailbox.PortableUnixMailbox(f, Parser().parse): if pats is not None: *************** *** 62,66 **** if dosa: deSA(msg) ! print msg def usage(): --- 63,67 ---- if dosa: deSA(msg) ! gen(msg, unixfrom=1) def usage(): From anthonybaxter@users.sourceforge.net Wed Sep 25 03:06:54 2002 From: anthonybaxter@users.sourceforge.net (Anthony Baxter) Date: Tue, 24 Sep 2002 19:06:54 -0700 Subject: [Spambayes-checkins] spambayes README.txt,1.26,1.27 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv20718 Modified Files: README.txt Log Message: document BAYESCUSTOMIZE Index: README.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/README.txt,v retrieving revision 1.26 retrieving revision 1.27 diff -C2 -d -r1.26 -r1.27 *** README.txt 25 Sep 2002 01:01:49 -0000 1.26 --- README.txt 25 Sep 2002 02:06:52 -0000 1.27 *************** *** 41,44 **** --- 41,49 ---- near the start, and consult attributes of options. + As an alternative to bayescustomize.ini, you can set the environment + variable BAYESCUSTOMIZE to a list of one or more .ini files, these will + be read in, in order, and applied to the options. This allows you to + tweak individual runs by combining fragments of .ini files. + classifier.py An implementation of a Graham-like classifier. From gward@users.sourceforge.net Wed Sep 25 03:09:00 2002 From: gward@users.sourceforge.net (Greg Ward) Date: Tue, 24 Sep 2002 19:09:00 -0700 Subject: [Spambayes-checkins] spambayes unheader.py,1.5,1.6 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv21343 Modified Files: unheader.py Log Message: Make Parser a HeaderParser subclass, so get_payload() returns the raw message body. Necessary because doSA() assuems get_payload() always returns a string, which isn't so if the message has MIME structure. Index: unheader.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/unheader.py,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** unheader.py 25 Sep 2002 02:02:41 -0000 1.5 --- unheader.py 25 Sep 2002 02:08:58 -0000 1.6 *************** *** 24,28 **** self._headers[i] = (k, newval) ! class Parser(email.Parser.Parser): def __init__(self): email.Parser.Parser.__init__(self, Message) --- 24,28 ---- self._headers[i] = (k, newval) ! class Parser(email.Parser.HeaderParser): def __init__(self): email.Parser.Parser.__init__(self, Message) From gvanrossum@users.sourceforge.net Wed Sep 25 03:09:54 2002 From: gvanrossum@users.sourceforge.net (Guido van Rossum) Date: Tue, 24 Sep 2002 19:09:54 -0700 Subject: [Spambayes-checkins] spambayes README.txt,1.27,1.28 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv21573 Modified Files: README.txt Log Message: Clarify how to make BAYESCUSTOMIZE into a list (the delimiter is whitespace). Index: README.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/README.txt,v retrieving revision 1.27 retrieving revision 1.28 diff -C2 -d -r1.27 -r1.28 *** README.txt 25 Sep 2002 02:06:52 -0000 1.27 --- README.txt 25 Sep 2002 02:09:52 -0000 1.28 *************** *** 41,48 **** near the start, and consult attributes of options. ! As an alternative to bayescustomize.ini, you can set the environment ! variable BAYESCUSTOMIZE to a list of one or more .ini files, these will ! be read in, in order, and applied to the options. This allows you to ! tweak individual runs by combining fragments of .ini files. classifier.py --- 41,49 ---- near the start, and consult attributes of options. ! As an alternative to bayescustomize.ini, you can set the ! environment variable BAYESCUSTOMIZE to a whitespace-separated list ! of one or more .ini files, these will be read in, in order, and ! applied to the options. This allows you to tweak individual runs ! by combining fragments of .ini files. classifier.py From gvanrossum@users.sourceforge.net Wed Sep 25 03:22:18 2002 From: gvanrossum@users.sourceforge.net (Guido van Rossum) Date: Tue, 24 Sep 2002 19:22:18 -0700 Subject: [Spambayes-checkins] spambayes rates.py,1.6,1.7 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv25639 Modified Files: rates.py Log Message: If basename ends in .txt, strip it off. I kept creating files named foo.txts.txt because Unix filename completion adds the .txt part... Index: rates.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/rates.py,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** rates.py 22 Sep 2002 04:19:08 -0000 1.6 --- rates.py 25 Sep 2002 02:22:15 -0000 1.7 *************** *** 35,38 **** --- 35,40 ---- def doit(basename): + if basename.endswith('.txt'): + basename = basename[:-4] try: ifile = file(basename + '.txt') From montanaro@users.sourceforge.net Wed Sep 25 03:45:33 2002 From: montanaro@users.sourceforge.net (Skip Montanaro) Date: Tue, 24 Sep 2002 19:45:33 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.30,1.31 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv32737 Modified Files: Options.py Log Message: change one quoted string from "-quotes to '-quotes to keep emacs-mode happy. Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.30 retrieving revision 1.31 diff -C2 -d -r1.30 -r1.31 *** Options.py 24 Sep 2002 06:13:29 -0000 1.30 --- Options.py 25 Sep 2002 02:45:31 -0000 1.31 *************** *** 122,127 **** show_false_negatives: False ! # Near the end of Driver.test(), you can get a listing of the "best ! # discriminators" in the words from the training sets. These are the # words whose WordInfo.killcount values are highest, meaning they most # often were among the most extreme clues spamprob() found. The number --- 122,127 ---- show_false_negatives: False ! # Near the end of Driver.test(), you can get a listing of the 'best ! # discriminators' in the words from the training sets. These are the # words whose WordInfo.killcount values are highest, meaning they most # often were among the most extreme clues spamprob() found. The number From tim_one@users.sourceforge.net Wed Sep 25 04:13:12 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Tue, 24 Sep 2002 20:13:12 -0700 Subject: [Spambayes-checkins] spambayes TestDriver.py,1.10,1.11 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv8305 Modified Files: TestDriver.py Log Message: Compute population sdev instead of sample sdev for histogram displays; it doesn't really matter for the purposes of histograms, and using pop sdev makes it more consistent with the speculative central-limit code. Index: TestDriver.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v retrieving revision 1.10 retrieving revision 1.11 diff -C2 -d -r1.10 -r1.11 *** TestDriver.py 24 Sep 2002 22:13:19 -0000 1.10 --- TestDriver.py 25 Sep 2002 03:13:09 -0000 1.11 *************** *** 64,76 **** def display(self, WIDTH=60): from math import sqrt ! if self.n > 1: mean = self.sum / self.n ! # sum (x_i - mean)**2 = sum (x_i**2 - 2*x_i*mean + mean**2) = ! # sum x_i**2 - 2*mean*sum x_i + sum mean**2 = ! # sum x_i**2 - 2*mean*mean*n + n*mean**2 = ! # sum x_i**2 - n*mean**2 ! samplevar = (self.sumsq - self.n * mean**2) / (self.n - 1) ! print "%d items; mean %.2f; sample sdev %.2f" % (self.n, ! mean, sqrt(samplevar)) biggest = max(self.buckets) --- 64,71 ---- def display(self, WIDTH=60): from math import sqrt ! if self.n > 0: mean = self.sum / self.n ! var = self.sumsq / self.n - mean**2 ! print "%d items; mean %.2f; sdev %.2f" % (self.n, mean, sqrt(var)) biggest = max(self.buckets) From tim_one@users.sourceforge.net Wed Sep 25 04:16:52 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Tue, 24 Sep 2002 20:16:52 -0700 Subject: [Spambayes-checkins] spambayes cmp.py,1.13,1.14 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv9307 Modified Files: cmp.py Log Message: Dang. Changing the histogram output broke pattern-matching code here. Index: cmp.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/cmp.py,v retrieving revision 1.13 retrieving revision 1.14 diff -C2 -d -r1.13 -r1.14 *** cmp.py 24 Sep 2002 11:43:06 -0000 1.13 --- cmp.py 25 Sep 2002 03:16:50 -0000 1.14 *************** *** 18,22 **** # total f-n, # average f-p rate, ! # average f-n rate) # from summary file f. def suck(f): --- 18,22 ---- # total f-n, # average f-p rate, ! # average f-n rate, # from summary file f. def suck(f): *************** *** 31,35 **** if line.startswith('-> tested'): print line, ! if line.find('sample sdev') != -1: vals = line.split(';') mean = float(vals[1].split(' ')[-1]) --- 31,35 ---- if line.startswith('-> tested'): print line, ! if line.find('; sdev ') != -1: vals = line.split(';') mean = float(vals[1].split(' ')[-1]) *************** *** 65,69 **** fpmean = float(get().split()[-1]) fnmean = float(get().split()[-1]) ! return fps, fns, fptot, fntot, fpmean, fnmean, hamdev, spamdev,hamdevall,spamdevall def tag(p1, p2): --- 65,70 ---- fpmean = float(get().split()[-1]) fnmean = float(get().split()[-1]) ! return (fps, fns, fptot, fntot, fpmean, fnmean, ! hamdev, spamdev, hamdevall, spamdevall) def tag(p1, p2): From tim_one@users.sourceforge.net Wed Sep 25 04:26:43 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Tue, 24 Sep 2002 20:26:43 -0700 Subject: [Spambayes-checkins] spambayes cmp.py,1.14,1.15 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv11854 Modified Files: cmp.py Log Message: Repaired more consequences of the pattern-matching stuff. Index: cmp.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/cmp.py,v retrieving revision 1.14 retrieving revision 1.15 diff -C2 -d -r1.14 -r1.15 *** cmp.py 25 Sep 2002 03:16:50 -0000 1.14 --- cmp.py 25 Sep 2002 03:26:40 -0000 1.15 *************** *** 19,22 **** --- 19,27 ---- # average f-p rate, # average f-n rate, + # list of all ham score deviations, + # list of all spam score deviations, + # ham score deviation for all runs, + # spam score deviations for all runs, + # ) # from summary file f. def suck(f): *************** *** 31,40 **** if line.startswith('-> tested'): print line, ! if line.find('; sdev ') != -1: vals = line.split(';') ! mean = float(vals[1].split(' ')[-1]) ! sdev = float(vals[2].split(' ')[-1]) ! val = (mean,sdev) ! typ = vals[0].split(' ')[2] if line.find('for all runs') != -1: if typ == 'Ham': --- 36,47 ---- if line.startswith('-> tested'): print line, ! if line.find(' items; mean ') != -1: ! "-> Ham distribution for this pair: 1000 items; mean 0.05; sample sdev 0.68" ! # and later "sample " went away vals = line.split(';') ! mean = float(vals[1].split()[-1]) ! sdev = float(vals[2].split()[-1]) ! val = (mean, sdev) ! typ = vals[0].split()[2] if line.find('for all runs') != -1: if typ == 'Ham': From tim_one@users.sourceforge.net Wed Sep 25 04:29:03 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Tue, 24 Sep 2002 20:29:03 -0700 Subject: [Spambayes-checkins] spambayes cmp.py,1.15,1.16 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv12543 Modified Files: cmp.py Log Message: Split long lines, added commas. Index: cmp.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/cmp.py,v retrieving revision 1.15 retrieving revision 1.16 diff -C2 -d -r1.15 -r1.16 *** cmp.py 25 Sep 2002 03:26:40 -0000 1.15 --- cmp.py 25 Sep 2002 03:29:01 -0000 1.16 *************** *** 37,41 **** print line, if line.find(' items; mean ') != -1: ! "-> Ham distribution for this pair: 1000 items; mean 0.05; sample sdev 0.68" # and later "sample " went away vals = line.split(';') --- 37,41 ---- print line, if line.find(' items; mean ') != -1: ! # -> Ham distribution for this pair: 1000 items; mean 0.05; sample sdev 0.68 # and later "sample " went away vals = line.split(';') *************** *** 132,137 **** f2n = windowsfy(f2n) ! fp1, fn1, fptot1, fntot1, fpmean1, fnmean1,hamdev1,spamdev1,hamdevall1,spamdevall1 = suck(file(f1n)) ! fp2, fn2, fptot2, fntot2, fpmean2, fnmean2,hamdev2,spamdev2,hamdevall2,spamdevall2 = suck(file(f2n)) print --- 132,140 ---- f2n = windowsfy(f2n) ! (fp1, fn1, fptot1, fntot1, fpmean1, fnmean1, ! hamdev1, spamdev1, hamdevall1, spamdevall1) = suck(file(f1n)) ! ! (fp2, fn2, fptot2, fntot2, fpmean2, fnmean2, ! hamdev2, spamdev2, hamdevall2, spamdevall2) = suck(file(f2n)) print *************** *** 163,165 **** diff1 = spamdevall1[0] - hamdevall1[0] diff2 = spamdevall2[0] - hamdevall2[0] ! print "ham/spam mean difference: %2.2f %2.2f %+2.2f" % (diff1,diff2,(diff2-diff1)) --- 166,170 ---- diff1 = spamdevall1[0] - hamdevall1[0] diff2 = spamdevall2[0] - hamdevall2[0] ! print "ham/spam mean difference: %2.2f %2.2f %+2.2f" % (diff1, ! diff2, ! diff2 - diff1) From tim_one@users.sourceforge.net Wed Sep 25 06:22:49 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Tue, 24 Sep 2002 22:22:49 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.31,1.32 TestDriver.py,1.11,1.12 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv12473 Modified Files: Options.py TestDriver.py Log Message: New option compute_best_cutoffs_from_histograms, enabled by default. This automates analyzing histograms to find "the best" spam_cutoff. Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.31 retrieving revision 1.32 diff -C2 -d -r1.31 -r1.32 *** Options.py 25 Sep 2002 02:45:31 -0000 1.31 --- Options.py 25 Sep 2002 05:22:47 -0000 1.32 *************** *** 111,114 **** --- 111,120 ---- show_histograms: True + # When compute_best_cutoffs_from_histograms is enabled, after the display + # of a ham+spam histogram pair, a listing is given of all the cutoff scores + # (coinciding with a histogram boundary) that minimize the total number of + # misclassified messages (false positives + false negatives). + compute_best_cutoffs_from_histograms: True + # Display spam when # show_spam_lo <= spamprob <= show_spam_hi *************** *** 151,155 **** save_histogram_pickles: False ! # default locations for timcv and timtest - these get the set number # interpolated. spam_directories: Data/Spam/Set%d --- 157,161 ---- save_histogram_pickles: False ! # default locations for timcv and timtest - these get the set number # interpolated. spam_directories: Data/Spam/Set%d *************** *** 247,250 **** --- 253,257 ---- 'spam_directories': string_cracker, 'ham_directories': string_cracker, + 'compute_best_cutoffs_from_histograms': boolean_cracker, }, 'Classifier': {'hambias': float_cracker, Index: TestDriver.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v retrieving revision 1.11 retrieving revision 1.12 diff -C2 -d -r1.11 -r1.12 *** TestDriver.py 25 Sep 2002 03:13:09 -0000 1.11 --- TestDriver.py 25 Sep 2002 05:22:47 -0000 1.12 *************** *** 92,95 **** --- 92,130 ---- spam.display() + if not options.compute_best_cutoffs_from_histograms: + return + + # Figure out "the best" spam cutoff point, meaning the one that minimizes + # the total number of misclassified msgs (other definitions are + # certainly possible!). + + # At cutoff 0, everything is called spam, so there are no false negatives, + # and every ham is a false positive. + assert ham.nbuckets == spam.nbuckets + fp = ham.n + fn = 0 + best_total = fp + bests = [(0, fp, fn)] + for i in range(ham.nbuckets): + # When moving the cutoff beyond bucket i, the ham in bucket i + # are redeemed, and the spam in bucket i become false negatives. + fp -= ham.buckets[i] + fn += spam.buckets[i] + if fp + fn <= best_total: + if fp + fn < best_total: + best_total = fp + fn + bests = [] + bests.append((i+1, fp, fn)) + assert fp == 0 + assert fn == spam.n + + i, fp, fn = bests.pop(0) + print '-> best cutoff for', tag, float(i) / ham.nbuckets + print '-> with', fp, 'fp', '+', fn, 'fn =', best_total, 'mistakes' + for i, fp, fn in bests: + print '-> matched at %g (%d fp + %d fn)' % ( + float(i) / ham.nbuckets, fp, fn) + + def printmsg(msg, prob, clues): print msg.tag From gvanrossum@users.sourceforge.net Wed Sep 25 17:24:29 2002 From: gvanrossum@users.sourceforge.net (Guido van Rossum) Date: Wed, 25 Sep 2002 09:24:29 -0700 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.33,1.34 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv26958 Modified Files: tokenizer.py Log Message: get_charsets() can return a charset that is a triple of the form (encoding, language, data). Extract the data, assuming the encoding is an ASCII superset and the data (a charset name) is in fact just ascii characters. (The only occurrence in real life of this I've seen uses an encoding name "ansi-x3-4-1968", which is an obscure name for ASCII that Python's codecs collection doesn't seem to support. Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.33 retrieving revision 1.34 diff -C2 -d -r1.33 -r1.34 *** tokenizer.py 23 Sep 2002 14:38:41 -0000 1.33 --- tokenizer.py 25 Sep 2002 16:24:26 -0000 1.34 *************** *** 724,727 **** --- 724,730 ---- for x in msg.get_charsets(None): if x is not None: + if isinstance(x, tuple): + assert len(x) == 3 + x = x[2] yield 'charset:' + x.lower() From gward@users.sourceforge.net Wed Sep 25 18:56:12 2002 From: gward@users.sourceforge.net (Greg Ward) Date: Wed, 25 Sep 2002 10:56:12 -0700 Subject: [Spambayes-checkins] spambayes unheader.py,1.6,1.7 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv24962a Modified Files: unheader.py Log Message: Add Maildir support: * add -d option * rearrange main() accordingly (NB. I removed the ability to read an mbox folder from stdin, since it didn't actually work and made main() more complicated) * add process_maildir() * factor process_message() out of process_mailbox() Index: unheader.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/unheader.py,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** unheader.py 25 Sep 2002 02:08:58 -0000 1.6 --- unheader.py 25 Sep 2002 17:56:09 -0000 1.7 *************** *** 3,6 **** --- 3,8 ---- import re import sys + import os + import glob import mailbox import email.Parser *************** *** 56,79 **** unheader(msg, "X-Spam-") def process_mailbox(f, dosa=1, pats=None): gen = email.Generator.Generator(sys.stdout, maxheaderlen=0) for msg in mailbox.PortableUnixMailbox(f, Parser().parse): ! if pats is not None: ! unheader(msg, pats) ! if dosa: ! deSA(msg) gen(msg, unixfrom=1) def usage(): ! print >> sys.stderr, "usage: unheader.py [ -p pat ... ] [ -s ]" print >> sys.stderr, "-p pat gives a regex pattern used to eliminate unwanted headers" print >> sys.stderr, "'-p pat' may be given multiple times" print >> sys.stderr, "-s tells not to remove SpamAssassin headers" def main(args): headerpats = [] dosa = 1 try: ! opts, args = getopt.getopt(args, "p:sh") except getopt.GetoptError: usage() --- 58,101 ---- unheader(msg, "X-Spam-") + def process_message(msg, dosa, pats): + if pats is not None: + unheader(msg, pats) + if dosa: + deSA(msg) + def process_mailbox(f, dosa=1, pats=None): gen = email.Generator.Generator(sys.stdout, maxheaderlen=0) for msg in mailbox.PortableUnixMailbox(f, Parser().parse): ! process_message(msg, dosa, pats) gen(msg, unixfrom=1) + def process_maildir(d, dosa=1, pats=None): + parser = Parser() + for fn in glob.glob(os.path.join(d, "cur", "*")): + print ("reading from %s..." % fn), + file = open(fn) + msg = parser.parse(file) + process_message(msg, dosa, pats) + + tmpfn = os.path.join(d, "tmp", os.path.basename(fn)) + tmpfile = open(tmpfn, "w") + print "writing to %s" % tmpfn + email.Generator.Generator(tmpfile, maxheaderlen=0)(msg, unixfrom=0) + + os.rename(tmpfn, fn) + def usage(): ! print >> sys.stderr, "usage: unheader.py [ -p pat ... ] [ -s ] folder" print >> sys.stderr, "-p pat gives a regex pattern used to eliminate unwanted headers" print >> sys.stderr, "'-p pat' may be given multiple times" print >> sys.stderr, "-s tells not to remove SpamAssassin headers" + print >> sys.stderr, "-d means treat folder as a Maildir" def main(args): headerpats = [] dosa = 1 + ismbox = 1 try: ! opts, args = getopt.getopt(args, "p:shd") except getopt.GetoptError: usage() *************** *** 88,100 **** elif opt == "-s": dosa = 0 pats = headerpats and "|".join(headerpats) or None ! if not args: ! f = sys.stdin ! elif len(args) == 1: ! f = file(args[0]) ! else: usage() sys.exit(1) ! process_mailbox(f, dosa, pats) if __name__ == "__main__": --- 110,126 ---- elif opt == "-s": dosa = 0 + elif opt == "-d": + ismbox = 0 pats = headerpats and "|".join(headerpats) or None ! ! if len(args) != 1: usage() sys.exit(1) ! ! if ismbox: ! f = file(args[0]) ! process_mailbox(f, dosa, pats) ! else: ! process_maildir(args[0], dosa, pats) if __name__ == "__main__": From tim_one@users.sourceforge.net Wed Sep 25 19:39:22 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Wed, 25 Sep 2002 11:39:22 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.32,1.33 TestDriver.py,1.12,1.13 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv15075 Modified Files: Options.py TestDriver.py Log Message: New option best_cutoff_fp_weight. The histogram analysis code now finds the buckets that minimize best_cutoff_fp_weight * (# false positives) + (# false negatives) By default it's 1 (minimize total # of misclassified msgs). If, e.g., you're happy to endure 100 false negatives to save 1 false positive, set to 100. Don't be surprised if your f-n rate zooms, though! Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.32 retrieving revision 1.33 diff -C2 -d -r1.32 -r1.33 *** Options.py 25 Sep 2002 05:22:47 -0000 1.32 --- Options.py 25 Sep 2002 18:39:17 -0000 1.33 *************** *** 102,108 **** # well as 0.90 on Tim's large c.l.py data). # For Gary Robinson's scheme, some value between 0.50 and 0.60 has worked ! # best in all reports so far. Note that you can easily deduce the effect ! # of setting spam_cutoff to any particular value by studying the score ! # histograms -- there's no need to run a test again to see what would happen. spam_cutoff: 0.90 --- 102,106 ---- # well as 0.90 on Tim's large c.l.py data). # For Gary Robinson's scheme, some value between 0.50 and 0.60 has worked ! # best in all reports so far. spam_cutoff: 0.90 *************** *** 111,119 **** show_histograms: True ! # When compute_best_cutoffs_from_histograms is enabled, after the display ! # of a ham+spam histogram pair, a listing is given of all the cutoff scores ! # (coinciding with a histogram boundary) that minimize the total number of ! # misclassified messages (false positives + false negatives). compute_best_cutoffs_from_histograms: True # Display spam when --- 109,127 ---- show_histograms: True ! # After the display of a ham+spam histogram pair, you can get a listing of ! # all the cutoff values (coinciding histogram bucket boundaries) that ! # minimize ! # ! # best_cutoff_fp_weight * (# false positives) + (# false negatives) ! # ! # By default, best_cutoff_fp_weight is 1, and so the cutoffs that miminize ! # the total number of misclassified messages (fp+fn) are shown. If you hate ! # fp more than fn, set the weight to something larger than 1. For example, ! # if you're willing to endure 100 false negatives to save 1 false positive, ! # set it to 100. ! # Note: You may wish to increase nbuckets, to give this scheme more cutoff ! # values to analyze. compute_best_cutoffs_from_histograms: True + best_cutoff_fp_weight: 1 # Display spam when *************** *** 254,257 **** --- 262,266 ---- 'ham_directories': string_cracker, 'compute_best_cutoffs_from_histograms': boolean_cracker, + 'best_cutoff_fp_weight': float_cracker, }, 'Classifier': {'hambias': float_cracker, Index: TestDriver.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v retrieving revision 1.12 retrieving revision 1.13 diff -C2 -d -r1.12 -r1.13 *** TestDriver.py 25 Sep 2002 05:22:47 -0000 1.12 --- TestDriver.py 25 Sep 2002 18:39:17 -0000 1.13 *************** *** 102,108 **** # and every ham is a false positive. assert ham.nbuckets == spam.nbuckets fp = ham.n fn = 0 ! best_total = fp bests = [(0, fp, fn)] for i in range(ham.nbuckets): --- 102,109 ---- # and every ham is a false positive. assert ham.nbuckets == spam.nbuckets + fpw = options.best_cutoff_fp_weight fp = ham.n fn = 0 ! best_total = fpw * fp + fn bests = [(0, fp, fn)] for i in range(ham.nbuckets): *************** *** 111,117 **** fp -= ham.buckets[i] fn += spam.buckets[i] ! if fp + fn <= best_total: ! if fp + fn < best_total: ! best_total = fp + fn bests = [] bests.append((i+1, fp, fn)) --- 112,119 ---- fp -= ham.buckets[i] fn += spam.buckets[i] ! total = fpw * fp + fn ! if total <= best_total: ! if total < best_total: ! best_total = total bests = [] bests.append((i+1, fp, fn)) *************** *** 121,128 **** i, fp, fn = bests.pop(0) print '-> best cutoff for', tag, float(i) / ham.nbuckets ! print '-> with', fp, 'fp', '+', fn, 'fn =', best_total, 'mistakes' for i, fp, fn in bests: ! print '-> matched at %g (%d fp + %d fn)' % ( ! float(i) / ham.nbuckets, fp, fn) --- 123,135 ---- i, fp, fn = bests.pop(0) print '-> best cutoff for', tag, float(i) / ham.nbuckets ! print '-> with weighted total %g*%d fp + %d fn = %g' % ( ! fpw, fp, fn, best_total) ! print '-> fp rate %.3g%% fn rate %.3g%%' % ( ! fp * 1e2 / ham.n, fn * 1e2 / spam.n) for i, fp, fn in bests: ! print ('-> matched at %g with %d fp & %d fn; ' ! 'fp rate %.3g%%; fn rate %.3g%%' % ( ! float(i) / ham.nbuckets, fp, fn, ! fp * 1e2 / ham.n, fn * 1e2 / spam.n)) From gward@users.sourceforge.net Wed Sep 25 21:07:10 2002 From: gward@users.sourceforge.net (Greg Ward) Date: Wed, 25 Sep 2002 13:07:10 -0700 Subject: [Spambayes-checkins] spambayes msgs.py,1.3,1.4 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv7639 Modified Files: msgs.py Log Message: Python 2.2 compat. Index: msgs.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/msgs.py,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** msgs.py 23 Sep 2002 21:20:10 -0000 1.3 --- msgs.py 25 Sep 2002 20:07:06 -0000 1.4 *************** *** 1,2 **** --- 1,4 ---- + from __future__ import generators + import os import random From tim_one@users.sourceforge.net Thu Sep 26 02:10:32 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Wed, 25 Sep 2002 18:10:32 -0700 Subject: [Spambayes-checkins] spambayes TestDriver.py,1.13,1.14 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv10786 Modified Files: TestDriver.py Log Message: The numerically naive way of computing the sdev for the histogram display finally went negative on me. This isn't worth fixing right -- just call it 0 when this happens here. Index: TestDriver.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v retrieving revision 1.13 retrieving revision 1.14 diff -C2 -d -r1.13 -r1.14 *** TestDriver.py 25 Sep 2002 18:39:17 -0000 1.13 --- TestDriver.py 26 Sep 2002 01:10:29 -0000 1.14 *************** *** 67,70 **** --- 67,75 ---- mean = self.sum / self.n var = self.sumsq / self.n - mean**2 + # The vagaries of f.p. rounding can make var come out negative. + # There are ways to fix that, but they're too painful for this + # part of the code to endure. + if var < 0.0: + var = 0.0 print "%d items; mean %.2f; sdev %.2f" % (self.n, mean, sqrt(var)) From tim_one@users.sourceforge.net Thu Sep 26 04:20:53 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Wed, 25 Sep 2002 20:20:53 -0700 Subject: [Spambayes-checkins] spambayes cmp.py,1.16,1.17 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv7354 Modified Files: cmp.py Log Message: Restored ability to analyze older result files (from before the time ham & spam mean & sdevs were displayed). Added more commas. Index: cmp.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/cmp.py,v retrieving revision 1.16 retrieving revision 1.17 diff -C2 -d -r1.16 -r1.17 *** cmp.py 25 Sep 2002 03:29:01 -0000 1.16 --- cmp.py 26 Sep 2002 03:20:51 -0000 1.17 *************** *** 30,33 **** --- 30,34 ---- hamdev = [] spamdev = [] + hamdevall = spamdevall = (0.0, 0.0) get = f.readline *************** *** 87,93 **** return t ! def mtag(m1,m2): ! mean1,dev1 = m1 ! mean2,dev2 = m2 t = "%7.2f %7.2f " % (mean1, mean2) if mean1: --- 88,94 ---- return t ! def mtag(m1, m2): ! mean1, dev1 = m1 ! mean2, dev2 = m2 t = "%7.2f %7.2f " % (mean1, mean2) if mean1: *************** *** 115,120 **** print ! def dumpdev(meandev1,meandev2): ! for m1,m2 in zip(meandev1,meandev2): print mtag(m1, m2) --- 116,121 ---- print ! def dumpdev(meandev1, meandev2): ! for m1, m2 in zip(meandev1, meandev2): print mtag(m1, m2) *************** *** 151,170 **** print ! print "ham mean ham sdev" ! dumpdev(hamdev1,hamdev2) ! print ! print "ham mean and sdev for all runs" ! dumpdev([hamdevall1],[hamdevall2]) ! print ! print "spam mean spam sdev" ! dumpdev(spamdev1,spamdev2) ! print ! print "spam mean and sdev for all runs" ! dumpdev([spamdevall1],[spamdevall2]) ! print ! diff1 = spamdevall1[0] - hamdevall1[0] ! diff2 = spamdevall2[0] - hamdevall2[0] ! print "ham/spam mean difference: %2.2f %2.2f %+2.2f" % (diff1, ! diff2, ! diff2 - diff1) --- 152,176 ---- print ! if len(hamdev1) == len(hamdev2) and len(spamdev1) == len(spamdev2): ! print "ham mean ham sdev" ! dumpdev(hamdev1, hamdev2) ! print ! print "ham mean and sdev for all runs" ! dumpdev([hamdevall1], [hamdevall2]) ! ! print ! print "spam mean spam sdev" ! dumpdev(spamdev1, spamdev2) ! print ! print "spam mean and sdev for all runs" ! dumpdev([spamdevall1], [spamdevall2]) ! ! print ! diff1 = spamdevall1[0] - hamdevall1[0] ! diff2 = spamdevall2[0] - hamdevall2[0] ! print "ham/spam mean difference: %2.2f %2.2f %+2.2f" % (diff1, ! diff2, ! diff2 - diff1) ! else: ! print "[info about ham & spam means & sdevs not available in both files]" From barry@users.sourceforge.net Thu Sep 26 04:22:58 2002 From: barry@users.sourceforge.net (Barry Warsaw) Date: Wed, 25 Sep 2002 20:22:58 -0700 Subject: [Spambayes-checkins] spambayes/email __init__.py,1.1.1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes/email In directory usw-pr-cvs1:/tmp/cvs-serv8023 Modified Files: __init__.py Log Message: On Guido's request, backporting mimelib change: Move the imports of Parser and Message inside the message_from_string() and message_from_file() functions. This way just "import email" won't suck in most of the submodules of the package. Note: this will break code that relied on "import email" giving you a bunch of the submodules, but that was never documented and should not have been relied on. However, I'm setting __version__ to 2.4a0 since 2.4 has not yet been released (waiting for closure on a few outstanding issues). Index: __init__.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/email/__init__.py,v retrieving revision 1.1.1.1 retrieving revision 1.2 diff -C2 -d -r1.1.1.1 -r1.2 *** __init__.py 23 Sep 2002 13:18:55 -0000 1.1.1.1 --- __init__.py 26 Sep 2002 03:22:56 -0000 1.2 *************** *** 5,9 **** """ ! __version__ = '2.3.1' __all__ = ['Charset', --- 5,9 ---- """ ! __version__ = '2.4a0' __all__ = ['Charset', *************** *** 29,39 **** ! # Some convenience routines ! from email.Parser import Parser as _Parser ! from email.Message import Message as _Message ! ! def message_from_string(s, _class=_Message, strict=0): ! return _Parser(_class, strict=strict).parsestr(s) ! def message_from_file(fp, _class=_Message, strict=0): ! return _Parser(_class, strict=strict).parse(fp) --- 29,46 ---- ! # Some convenience routines. Don't import Parser and Message as side-effects ! # of importing email since those cascadingly import most of the rest of the ! # email package. ! def message_from_string(s, _class=None, strict=0): ! from email.Parser import Parser ! if _class is None: ! from email.Message import Message ! _class = Message ! return Parser(_class, strict=strict).parsestr(s) ! def message_from_file(fp, _class=None, strict=0): ! from email.Parser import Parser ! if _class is None: ! from email.Message import Message ! _class = Message ! return Parser(_class, strict=strict).parse(fp) From anthonybaxter@users.sourceforge.net Thu Sep 26 09:24:33 2002 From: anthonybaxter@users.sourceforge.net (Anthony Baxter) Date: Thu, 26 Sep 2002 01:24:33 -0700 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.34,1.35 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv11559 Modified Files: tokenizer.py Log Message: a number of my remaining false positives were caused by HTML email with inline stylesheets. These were punished because things like COLOR: #ffffff now are only seem in badly formatted spams with HTML in non-HTML MIME parts. Strip out the stylesheets when we strip out HTML. Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.34 retrieving revision 1.35 diff -C2 -d -r1.34 -r1.35 *** tokenizer.py 25 Sep 2002 16:24:26 -0000 1.34 --- tokenizer.py 26 Sep 2002 08:24:30 -0000 1.35 *************** *** 575,578 **** --- 575,582 ---- """, re.VERBOSE) + # An equally cheap-ass gimmick to strip style sheets + stylesheet_re = re.compile(r"", + re.IGNORECASE|re.DOTALL) + received_host_re = re.compile(r'from (\S+)\s') received_ip_re = re.compile(r'\s[[(]((\d{1,3}\.?){4})[\])]') *************** *** 1040,1043 **** --- 1044,1048 ---- not options.retain_pure_html_tags): text = html_re.sub(' ', text) + text = stylesheet_re.sub(' ', text) # Tokenize everything in the body. From anthonybaxter@users.sourceforge.net Thu Sep 26 09:35:08 2002 From: anthonybaxter@users.sourceforge.net (Anthony Baxter) Date: Thu, 26 Sep 2002 01:35:08 -0700 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.35,1.36 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv14947 Modified Files: tokenizer.py Log Message: *sigh* do them in the right order. This is why we run the full test before we do Mr. Checkin, isn't it, Anthony? Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.35 retrieving revision 1.36 diff -C2 -d -r1.35 -r1.36 *** tokenizer.py 26 Sep 2002 08:24:30 -0000 1.35 --- tokenizer.py 26 Sep 2002 08:35:06 -0000 1.36 *************** *** 1043,1048 **** if (part.get_content_type() == "text/plain" or not options.retain_pure_html_tags): - text = html_re.sub(' ', text) text = stylesheet_re.sub(' ', text) # Tokenize everything in the body. --- 1043,1048 ---- if (part.get_content_type() == "text/plain" or not options.retain_pure_html_tags): text = stylesheet_re.sub(' ', text) + text = html_re.sub(' ', text) # Tokenize everything in the body. From sjoerd@users.sourceforge.net Thu Sep 26 09:40:22 2002 From: sjoerd@users.sourceforge.net (Sjoerd Mullender) Date: Thu, 26 Sep 2002 01:40:22 -0700 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.36,1.37 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv16737 Modified Files: tokenizer.py Log Message: Import email.Message and email.Errors explicitly. Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.36 retrieving revision 1.37 diff -C2 -d -r1.36 -r1.37 *** tokenizer.py 26 Sep 2002 08:35:06 -0000 1.36 --- tokenizer.py 26 Sep 2002 08:40:20 -0000 1.37 *************** *** 5,8 **** --- 5,10 ---- import email + import email.Message + import email.Errors import re from sets import Set From sjoerd@users.sourceforge.net Thu Sep 26 09:46:13 2002 From: sjoerd@users.sourceforge.net (Sjoerd Mullender) Date: Thu, 26 Sep 2002 01:46:13 -0700 Subject: [Spambayes-checkins] spambayes HistToGNU.py,1.3,1.4 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv18312 Modified Files: HistToGNU.py Log Message: Converted \r\n line endings to \n. Index: HistToGNU.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/HistToGNU.py,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** HistToGNU.py 24 Sep 2002 14:38:10 -0000 1.3 --- HistToGNU.py 26 Sep 2002 08:46:11 -0000 1.4 *************** *** 1,30 **** ! #! /usr/bin/env python ! ! """HistToGNU.py ! ! Convert saved binary pickle of histograms to gnu plot output ! ! Usage: %(program)s [options] [histogrampicklefile ...] ! ! reads pickle filename from options if not specified ! ! writes to stdout - """ - - globalOptions = """ - set grid - set xtics 5 - set xrange [0.0:100.0] - """ - - dataSetOptions="smooth unique" - from Options import options ! from TestDriver import Hist ! ! import sys import cPickle as pickle ! program = sys.argv[0] --- 1,30 ---- ! #! /usr/bin/env python ! ! """HistToGNU.py ! ! Convert saved binary pickle of histograms to gnu plot output ! ! Usage: %(program)s [options] [histogrampicklefile ...] ! ! reads pickle filename from options if not specified ! ! writes to stdout ! ! """ ! ! globalOptions = """ ! set grid ! set xtics 5 ! set xrange [0.0:100.0] ! """ ! ! dataSetOptions="smooth unique" from Options import options ! from TestDriver import Hist ! ! import sys import cPickle as pickle ! program = sys.argv[0] *************** *** 36,67 **** print >> sys.stderr, __doc__ % globals() sys.exit(code) ! ! def loadHist(path): ! """Load the histogram pickle object""" ! return pickle.load(file(path)) ! ! def outputHist(hist,f=sys.stdout): ! """Output the Hist object to file f""" ! for i in range(len(hist.buckets)): ! n = hist.buckets[i] ! if n: ! f.write("%.3f %d\n" % ( (100.0 * i) / hist.nbuckets, n)) ! ! def plot(files): ! """given a list of files, create gnu-plot file""" ! import cStringIO, os ! cmd = cStringIO.StringIO() ! cmd.write(globalOptions) ! args = [] ! for file in files: ! args.append("""'-' %s title "%s" """ % (dataSetOptions, file)) ! cmd.write('plot %s\n' % ",".join(args)) ! for file in files: ! outputHist(loadHist(file), cmd) ! cmd.write('e\n') ! ! cmd.write('pause 100\n') ! print cmd.getvalue() ! def main(): import getopt --- 36,67 ---- print >> sys.stderr, __doc__ % globals() sys.exit(code) ! ! def loadHist(path): ! """Load the histogram pickle object""" ! return pickle.load(file(path)) ! ! def outputHist(hist,f=sys.stdout): ! """Output the Hist object to file f""" ! for i in range(len(hist.buckets)): ! n = hist.buckets[i] ! if n: ! f.write("%.3f %d\n" % ( (100.0 * i) / hist.nbuckets, n)) ! ! def plot(files): ! """given a list of files, create gnu-plot file""" ! import cStringIO, os ! cmd = cStringIO.StringIO() ! cmd.write(globalOptions) ! args = [] ! for file in files: ! args.append("""'-' %s title "%s" """ % (dataSetOptions, file)) ! cmd.write('plot %s\n' % ",".join(args)) ! for file in files: ! outputHist(loadHist(file), cmd) ! cmd.write('e\n') ! ! cmd.write('pause 100\n') ! print cmd.getvalue() ! def main(): import getopt *************** *** 72,87 **** except getopt.error, msg: usage(1, msg) ! ! if not args and options.save_histogram_pickles: ! args = [] ! for f in ('ham', 'spam'): ! fname = "%s_%shist.pik" % (options.pickle_basename, f) ! args.append(fname) ! ! if args: ! plot(args) ! else: ! print "could not locate any files to plot" ! ! if __name__ == "__main__": ! main() --- 72,87 ---- except getopt.error, msg: usage(1, msg) ! ! if not args and options.save_histogram_pickles: ! args = [] ! for f in ('ham', 'spam'): ! fname = "%s_%shist.pik" % (options.pickle_basename, f) ! args.append(fname) ! ! if args: ! plot(args) ! else: ! print "could not locate any files to plot" ! ! if __name__ == "__main__": ! main() From sjoerd@users.sourceforge.net Thu Sep 26 09:47:32 2002 From: sjoerd@users.sourceforge.net (Sjoerd Mullender) Date: Thu, 26 Sep 2002 01:47:32 -0700 Subject: [Spambayes-checkins] spambayes HistToGNU.py,1.4,1.5 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv18679 Modified Files: HistToGNU.py Log Message: Output all values since if you have a large value and then many 0 values, the line would just be a gentle slope instead of dropping down sharply. Index: HistToGNU.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/HistToGNU.py,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** HistToGNU.py 26 Sep 2002 08:46:11 -0000 1.4 --- HistToGNU.py 26 Sep 2002 08:47:29 -0000 1.5 *************** *** 45,50 **** for i in range(len(hist.buckets)): n = hist.buckets[i] ! if n: ! f.write("%.3f %d\n" % ( (100.0 * i) / hist.nbuckets, n)) def plot(files): --- 45,49 ---- for i in range(len(hist.buckets)): n = hist.buckets[i] ! f.write("%.3f %d\n" % ( (100.0 * i) / hist.nbuckets, n)) def plot(files): From barry@users.sourceforge.net Thu Sep 26 21:22:17 2002 From: barry@users.sourceforge.net (Barry Warsaw) Date: Thu, 26 Sep 2002 13:22:17 -0700 Subject: [Spambayes-checkins] spambayes/email Message.py,1.1.1.1,1.2 Message-ID: Update of /cvsroot/spambayes/spambayes/email In directory usw-pr-cvs1:/tmp/cvs-serv6467 Modified Files: Message.py Log Message: Side-porting from the email package: Fixing some RFC 2231 related issues as reported in the Spambayes project, and with assistance from Oleg Broytmann. Specifically, get_param(), get_params(): Document that these methods may return parameter values that are either strings, or 3-tuples in the case of RFC 2231 encoded parameters. The application should be prepared to deal with such return values. get_boundary(): Be prepared to deal with RFC 2231 encoded boundary parameters. It makes little sense to have boundaries that are anything but ascii, so if we get back a 3-tuple from get_param() we will decode it into ascii and let any failures percolate up. get_content_charset(): New method which treats the charset parameter just like the boundary parameter in get_boundary(). Note that "get_charset()" was already taken to return the default Charset object. get_charsets(): Rewrite to use get_content_charset(). Index: Message.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/email/Message.py,v retrieving revision 1.1.1.1 retrieving revision 1.2 diff -C2 -d -r1.1.1.1 -r1.2 *** Message.py 23 Sep 2002 13:18:55 -0000 1.1.1.1 --- Message.py 26 Sep 2002 20:22:15 -0000 1.2 *************** *** 54,58 **** def _unquotevalue(value): if isinstance(value, TupleType): ! return (value[0], value[1], Utils.unquote(value[2])) else: return Utils.unquote(value) --- 54,58 ---- def _unquotevalue(value): if isinstance(value, TupleType): ! return value[0], value[1], Utils.unquote(value[2]) else: return Utils.unquote(value) *************** *** 510,515 **** split on the `=' sign. The left hand side of the `=' is the key, while the right hand side is the value. If there is no `=' sign in ! the parameter the value is the empty string. The value is always ! unquoted, unless unquote is set to a false value. Optional failobj is the object to return if there is no Content-Type: --- 510,515 ---- split on the `=' sign. The left hand side of the `=' is the key, while the right hand side is the value. If there is no `=' sign in ! the parameter the value is the empty string. The value is as ! described in the get_param() method. Optional failobj is the object to return if there is no Content-Type: *************** *** 530,538 **** Optional failobj is the object to return if there is no Content-Type: ! header. Optional header is the header to search instead of ! Content-Type: ! Parameter keys are always compared case insensitively. Values are ! always unquoted, unless unquote is set to a false value. """ if not self.has_key(header): --- 530,550 ---- Optional failobj is the object to return if there is no Content-Type: ! header, or the Content-Type header has no such parameter. Optional ! header is the header to search instead of Content-Type: ! Parameter keys are always compared case insensitively. The return ! value can either be a string, or a 3-tuple if the parameter was RFC ! 2231 encoded. When it's a 3-tuple, the elements of the value are of ! the form (CHARSET, LANGUAGE, VALUE), where LANGUAGE may be the empty ! string. Your application should be prepared to deal with these, and ! can convert the parameter to a Unicode string like so: ! ! param = msg.get_param('foo') ! if isinstance(param, tuple): ! param = unicode(param[2], param[0]) ! ! In any case, the parameter value (either the returned string, or the ! VALUE item in the 3-tuple) is always unquoted, unless unquote is set ! to a false value. """ if not self.has_key(header): *************** *** 675,678 **** --- 687,693 ---- if boundary is missing: return failobj + if isinstance(boundary, TupleType): + # RFC 2231 encoded, so decode. It better end up as ascii + return unicode(boundary[2], boundary[0]).encode('us-ascii') return _unquotevalue(boundary.strip()) *************** *** 728,731 **** --- 743,761 ---- from email._compat21 import walk + def get_content_charset(self, failobj=None): + """Return the charset parameter of the Content-Type header. + + If there is no Content-Type header, or if that header has no charset + parameter, failobj is returned. + """ + missing = [] + charset = self.get_param('charset', missing) + if charset is missing: + return failobj + if isinstance(charset, TupleType): + # RFC 2231 encoded, so decode it, and it better end up as ascii. + return unicode(charset[2], charset[0]).encode('us-ascii') + return charset + def get_charsets(self, failobj=None): """Return a list containing the charset(s) used in this message. *************** *** 744,746 **** message will still return a list of length 1. """ ! return [part.get_param('charset', failobj) for part in self.walk()] --- 774,776 ---- message will still return a list of length 1. """ ! return [part.get_content_charset(failobj) for part in self.walk()] From gvanrossum@users.sourceforge.net Thu Sep 26 21:26:04 2002 From: gvanrossum@users.sourceforge.net (Guido van Rossum) Date: Thu, 26 Sep 2002 13:26:04 -0700 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.37,1.38 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv7821 Modified Files: tokenizer.py Log Message: Now that the email package has been updated, we don't need to deal with triples returned by get_charsets(). But we need to fix the aliases dictionary to include 'ansi_x3_4_1968'. Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.37 retrieving revision 1.38 diff -C2 -d -r1.37 -r1.38 *** tokenizer.py 26 Sep 2002 08:40:20 -0000 1.37 --- tokenizer.py 26 Sep 2002 20:26:02 -0000 1.38 *************** *** 12,15 **** --- 12,21 ---- from Options import options + # Patch encodings.aliases to recognize 'ansi_x3_4_1968' + from encodings.aliases import aliases # The aliases dictionary + if not aliases.has_key('ansi_x3_4_1968'): + aliases['ansi_x3_4_1968'] = 'ascii' + del aliases # Not needed any more + ############################################################################## # To fold case or not to fold case? I didn't want to fold case, because *************** *** 730,736 **** for x in msg.get_charsets(None): if x is not None: - if isinstance(x, tuple): - assert len(x) == 3 - x = x[2] yield 'charset:' + x.lower() --- 736,739 ---- From tim_one@users.sourceforge.net Fri Sep 27 01:08:15 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Thu, 26 Sep 2002 17:08:15 -0700 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.38,1.39 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv13081 Modified Files: tokenizer.py Log Message: stylesheet_re: removed the IGNORCASE. The text is already lower()ed, and IGNORECASE makes the engine do extra work. Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.38 retrieving revision 1.39 diff -C2 -d -r1.38 -r1.39 *** tokenizer.py 26 Sep 2002 20:26:02 -0000 1.38 --- tokenizer.py 27 Sep 2002 00:08:13 -0000 1.39 *************** *** 584,589 **** # An equally cheap-ass gimmick to strip style sheets ! stylesheet_re = re.compile(r"", ! re.IGNORECASE|re.DOTALL) received_host_re = re.compile(r'from (\S+)\s') --- 584,588 ---- # An equally cheap-ass gimmick to strip style sheets ! stylesheet_re = re.compile(r"", re.DOTALL) received_host_re = re.compile(r'from (\S+)\s') From tim_one@users.sourceforge.net Fri Sep 27 02:28:46 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Thu, 26 Sep 2002 18:28:46 -0700 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.39,1.40 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv30138 Modified Files: tokenizer.py Log Message: Beefed up HTML stripping: Accepts more kinds of ", re.DOTALL) received_host_re = re.compile(r'from (\S+)\s') --- 578,596 ---- html_re = re.compile(r""" < ! (?![\s<>]) # e.g., don't match 'a < b' or '<<<' or 'i<<5' or 'a<>b' ! (?: ! # style sheets can be very long ! style\b # maybe it's ]{0,256} # search for the end '>', but don't run wild ! ) > ! """, re.VERBOSE | re.DOTALL) received_host_re = re.compile(r'from (\S+)\s') *************** *** 1047,1051 **** if (part.get_content_type() == "text/plain" or not options.retain_pure_html_tags): - text = stylesheet_re.sub(' ', text) text = html_re.sub(' ', text) --- 1055,1058 ---- From nascheme@users.sourceforge.net Fri Sep 27 05:03:02 2002 From: nascheme@users.sourceforge.net (Neil Schemenauer) Date: Thu, 26 Sep 2002 21:03:02 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.33,1.34 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv4115 Modified Files: Options.py Log Message: Add mine_message_ids option. Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.33 retrieving revision 1.34 diff -C2 -d -r1.33 -r1.34 *** Options.py 25 Sep 2002 18:39:17 -0000 1.33 --- Options.py 27 Sep 2002 04:02:59 -0000 1.34 *************** *** 93,96 **** --- 93,99 ---- mine_received_headers: False + # If set, the Message-Id is broken down into, hopefully, useful evidence. + mine_message_ids: False + [TestDriver] # These control various displays in class TestDriver.Driver, and Tester.Test. *************** *** 239,242 **** --- 242,246 ---- 'count_all_header_lines': boolean_cracker, 'mine_received_headers': boolean_cracker, + 'mine_message_ids': boolean_cracker, 'check_octets': boolean_cracker, 'octet_prefix_size': int_cracker, From nascheme@users.sourceforge.net Fri Sep 27 05:06:15 2002 From: nascheme@users.sourceforge.net (Neil Schemenauer) Date: Thu, 26 Sep 2002 21:06:15 -0700 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.40,1.41 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv4807 Modified Files: tokenizer.py Log Message: Add basic message-id tokenization. Right now it just checks that it exists and conforms to the usual syntax. If it does, the host part is also returned. I tried doing more but the extra stuff was never considered a good discriminator. Stupid wins again. :-) Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.40 retrieving revision 1.41 diff -C2 -d -r1.40 -r1.41 *** tokenizer.py 27 Sep 2002 01:28:43 -0000 1.40 --- tokenizer.py 27 Sep 2002 04:06:12 -0000 1.41 *************** *** 597,600 **** --- 597,602 ---- received_ip_re = re.compile(r'\s[[(]((\d{1,3}\.?){4})[\])]') + message_id_re = re.compile(r'\s*<[^@]+@([^>]+)>\s*') + # I'm usually just splitting on whitespace, but for subject lines I want to # break things like "Python/Perl comparison?" up. OTOH, I don't want to *************** *** 981,984 **** --- 983,996 ---- for tok in breakdown(m.group(1).lower()): yield 'received:' + tok + + if options.mine_message_ids: + msgid = msg.get("message-id", "") + m = message_id_re.match(msgid) + if not m: + # might be weird instead of invalid but who cares? + yield 'message-id:invalid' + else: + # looks okay, return the hostname only + yield 'message-id:@%s' % m.group(1) # As suggested by Anthony Baxter, merely counting the number of From anthonybaxter@users.sourceforge.net Fri Sep 27 09:36:06 2002 From: anthonybaxter@users.sourceforge.net (Anthony Baxter) Date: Fri, 27 Sep 2002 01:36:06 -0700 Subject: [Spambayes-checkins] spambayes TestDriver.py,1.14,1.15 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv11761 Modified Files: TestDriver.py Log Message: more mixed line endings. Index: TestDriver.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v retrieving revision 1.14 retrieving revision 1.15 diff -C2 -d -r1.14 -r1.15 *** TestDriver.py 26 Sep 2002 01:10:29 -0000 1.14 --- TestDriver.py 27 Sep 2002 08:36:03 -0000 1.15 *************** *** 207,218 **** if options.show_histograms: printhist("all runs:", self.global_ham_hist, self.global_spam_hist) ! ! if options.save_histogram_pickles: ! for f, h in (('ham', self.global_ham_hist), ('spam', self.global_spam_hist)): ! fname = "%s_%shist.pik" % (options.pickle_basename, f) ! print " saving %s histogram pickle to %s" %(f, fname) ! fp = file(fname, 'wb') ! pickle.dump(h, fp, 1) ! fp.close() def test(self, ham, spam): --- 207,219 ---- if options.show_histograms: printhist("all runs:", self.global_ham_hist, self.global_spam_hist) ! ! if options.save_histogram_pickles: ! for f, h in (('ham', self.global_ham_hist), ! ('spam', self.global_spam_hist)): ! fname = "%s_%shist.pik" % (options.pickle_basename, f) ! print " saving %s histogram pickle to %s" %(f, fname) ! fp = file(fname, 'wb') ! pickle.dump(h, fp, 1) ! fp.close() def test(self, ham, spam): From gvanrossum@users.sourceforge.net Fri Sep 27 19:48:07 2002 From: gvanrossum@users.sourceforge.net (Guido van Rossum) Date: Fri, 27 Sep 2002 11:48:07 -0700 Subject: [Spambayes-checkins] spambayes hammie.py,1.21,1.22 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv22991 Modified Files: hammie.py Log Message: Patch inspired by Alexander Leiding to support multiple -g, -s, -u arguments. Index: hammie.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammie.py,v retrieving revision 1.21 retrieving revision 1.22 diff -C2 -d -r1.21 -r1.22 *** hammie.py 24 Sep 2002 00:38:37 -0000 1.21 --- hammie.py 27 Sep 2002 18:48:05 -0000 1.22 *************** *** 11,18 **** --- 11,21 ---- -g PATH mbox or directory of known good messages (non-spam) to train on. + Can be specified more than once. -s PATH mbox or directory of known spam messages to train on. + Can be specified more than once. -u PATH mbox of unknown messages. A ham/spam decision is reported for each. + Can be specified more than once. -p FILE use file as the persistent store. loads data from this file if it *************** *** 264,268 **** pck = DEFAULTDB ! good = spam = unknown = None do_filter = usedb = False for opt, arg in opts: --- 267,273 ---- pck = DEFAULTDB ! good = [] ! spam = [] ! unknown = [] do_filter = usedb = False for opt, arg in opts: *************** *** 270,276 **** usage(0) elif opt == '-g': ! good = arg elif opt == '-s': ! spam = arg elif opt == '-p': pck = arg --- 275,281 ---- usage(0) elif opt == '-g': ! good.append(arg) elif opt == '-s': ! spam.append(arg) elif opt == '-p': pck = arg *************** *** 280,284 **** do_filter = True elif opt == '-u': ! unknown = arg if args: usage(2, "Positional arguments not allowed") --- 285,289 ---- do_filter = True elif opt == '-u': ! unknown.append(arg) if args: usage(2, "Positional arguments not allowed") *************** *** 289,298 **** if good: ! print "Training ham:" ! train(bayes, good, False) save = True if spam: ! print "Training spam:" ! train(bayes, spam, True) save = True --- 294,305 ---- if good: ! for g in good: ! print "Training ham (%s):" % g ! train(bayes, g, False) save = True if spam: ! for s in spam: ! print "Training spam (%s):" % s ! train(bayes, s, True) save = True *************** *** 308,312 **** if unknown: ! score(bayes, unknown) if __name__ == "__main__": --- 315,322 ---- if unknown: ! for u in unknown: ! if len(unknown) > 1: ! print "Scoring", u ! score(bayes, u) if __name__ == "__main__": From npickett@users.sourceforge.net Fri Sep 27 20:40:27 2002 From: npickett@users.sourceforge.net (Neale Pickett) Date: Fri, 27 Sep 2002 12:40:27 -0700 Subject: [Spambayes-checkins] spambayes hammie.py,1.22,1.23 hammiesrv.py,1.2,1.3 runtest.sh,1.3,1.4 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv7026 Modified Files: hammie.py hammiesrv.py runtest.sh Log Message: * hammie.py now has a Hammie class, which hammiesrv now uses. hammie.py could still stand some more clean-up. Don't worry, I'm on it :) * runtest now has a run1 target to generate the first data Index: hammie.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammie.py,v retrieving revision 1.22 retrieving revision 1.23 diff -C2 -d -r1.22 -r1.23 *** hammie.py 27 Sep 2002 18:48:05 -0000 1.22 --- hammie.py 27 Sep 2002 19:40:21 -0000 1.23 *************** *** 61,65 **** class DBDict: ! """Database Dictionary This wraps an anydbm to make it look even more like a dictionary. --- 61,66 ---- class DBDict: ! ! """Database Dictionary. This wraps an anydbm to make it look even more like a dictionary. *************** *** 136,140 **** class PersistentGrahamBayes(classifier.GrahamBayes): ! """A persistent GrahamBayes classifier This is just like classifier.GrahamBayes, except that the dictionary --- 137,142 ---- class PersistentGrahamBayes(classifier.GrahamBayes): ! ! """A persistent GrahamBayes classifier. This is just like classifier.GrahamBayes, except that the dictionary *************** *** 177,181 **** ! def train(bayes, msgs, is_spam): """Train bayes with all messages from a mailbox.""" mbox = mboxutils.getmbox(msgs) --- 179,303 ---- ! class Hammie: ! ! """A spambayes mail filter""" ! ! def __init__(self, bayes): ! self.bayes = bayes ! ! def _scoremsg(self, msg, evidence=False): ! """Score a Message. ! ! msg can be a string, a file object, or a Message object. ! ! Returns the probability the message is spam. If evidence is ! true, returns a tuple: (probability, clues), where clues is a ! list of the words which contributed to the score. ! ! """ ! ! return self.bayes.spamprob(tokenize(msg), evidence) ! ! def formatclues(self, clues, sep="; "): ! """Format the clues into something readable.""" ! ! return sep.join(["%r: %.2f" % (word, prob) for word, prob in clues]) ! ! def score(self, msg, evidence=False): ! """Score (judge) a message. ! ! msg can be a string, a file object, or a Message object. ! ! Returns the probability the message is spam. If evidence is ! true, returns a tuple: (probability, clues), where clues is a ! list of the words which contributed to the score. ! ! """ ! ! try: ! return self._scoremsg(msg, evidence) ! except: ! print msg ! import traceback ! traceback.print_exc() ! ! def filter(self, msg, header=DISPHEADER, cutoff=SPAM_THRESHOLD): ! """Score (judge) a message and add a disposition header. ! ! msg can be a string, a file object, or a Message object. ! ! Optionally, set header to the name of the header to add, and/or ! cutoff to the probability value which must be met or exceeded ! for a message to get a 'Yes' disposition. ! ! Returns the same message with a new disposition header. ! ! """ ! ! if hasattr(msg, "readlines"): ! msg = email.message_from_file(msg) ! elif not hasattr(msg, "add_header"): ! msg = email.message_from_string(msg) ! prob, clues = self._scoremsg(msg, True) ! if prob < cutoff: ! disp = "No" ! else: ! disp = "Yes" ! disp += "; %.2f" % prob ! disp += "; " + self.formatclues(clues) ! msg.add_header(header, disp) ! return msg.as_string(unixfrom=(msg.get_unixfrom() is not None)) ! ! def train(self, msg, is_spam): ! """Train bayes with a message. ! ! msg can be a string, a file object, or a Message object. ! ! is_spam should be 1 if the message is spam, 0 if not. ! ! Probabilities are not updated after this call is made; to do ! that, call update_probabilities(). ! ! """ ! ! self.bayes.learn(tokenize(msg), is_spam, False) ! ! def train_ham(self, msg): ! """Train bayes with ham. ! ! msg can be a string, a file object, or a Message object. ! ! Probabilities are not updated after this call is made; to do ! that, call update_probabilities(). ! ! """ ! ! self.train(msg, False) ! ! def train_spam(self, msg): ! """Train bayes with spam. ! ! msg can be a string, a file object, or a Message object. ! ! Probabilities are not updated after this call is made; to do ! that, call update_probabilities(). ! ! """ ! ! self.train(msg, True) ! ! def update_probabilities(self): ! """Update probability values. ! ! You would want to call this after a training session. It's ! pretty slow, so if you have a lot of messages to train, wait ! until you're all done before calling this. ! ! """ ! ! self.bayes.update_probabilities() ! ! ! def train(hammie, msgs, is_spam): """Train bayes with all messages from a mailbox.""" mbox = mboxutils.getmbox(msgs) *************** *** 187,211 **** sys.stdout.write("\r%6d" % i) sys.stdout.flush() ! bayes.learn(tokenize(msg), is_spam, False) print ! def formatclues(clues, sep="; "): ! """Format the clues into something readable.""" ! return sep.join(["%r: %.2f" % (word, prob) for word, prob in clues]) ! ! def filter(bayes, input, output): ! """Filter (judge) a message""" ! msg = email.message_from_file(input) ! prob, clues = bayes.spamprob(tokenize(msg), True) ! if prob < SPAM_THRESHOLD: ! disp = "No" ! else: ! disp = "Yes" ! disp += "; %.2f" % prob ! disp += "; " + formatclues(clues) ! msg.add_header(DISPHEADER, disp) ! output.write(msg.as_string(unixfrom=(msg.get_unixfrom() is not None))) ! ! def score(bayes, msgs): """Score (judge) all messages from a mailbox.""" # XXX The reporting needs work! --- 309,316 ---- sys.stdout.write("\r%6d" % i) sys.stdout.flush() ! hammie.train(msg, is_spam) print ! def score(hammie, msgs): """Score (judge) all messages from a mailbox.""" # XXX The reporting needs work! *************** *** 215,219 **** for msg in mbox: i += 1 ! prob, clues = bayes.spamprob(tokenize(msg), True) isspam = prob >= SPAM_THRESHOLD if hasattr(msg, '_mh_msgno'): --- 320,324 ---- for msg in mbox: i += 1 ! prob, clues = hammie.score(msg, True) isspam = prob >= SPAM_THRESHOLD if hasattr(msg, '_mh_msgno'): *************** *** 224,228 **** spams += 1 print "%6s %4.2f %1s" % (msgno, prob, isspam and "S" or "."), ! print formatclues(clues) else: hams += 1 --- 329,333 ---- spams += 1 print "%6s %4.2f %1s" % (msgno, prob, isspam and "S" or "."), ! print hammie.formatclues(clues) else: hams += 1 *************** *** 292,309 **** bayes = createbayes(pck, usedb) ! if good: ! for g in good: ! print "Training ham (%s):" % g ! train(bayes, g, False) save = True ! if spam: ! for s in spam: ! print "Training spam (%s):" % s ! train(bayes, s, True) save = True if save: ! bayes.update_probabilities() if not usedb and pck: fp = open(pck, 'wb') --- 397,414 ---- bayes = createbayes(pck, usedb) + h = Hammie(bayes) ! for g in good: ! print "Training ham (%s):" % g ! train(h, g, False) save = True ! ! for s in spam: ! print "Training spam (%s):" % s ! train(h, s, True) save = True if save: ! h.update_probabilities() if not usedb and pck: fp = open(pck, 'wb') *************** *** 312,316 **** if do_filter: ! filter(bayes, sys.stdin, sys.stdout) if unknown: --- 417,423 ---- if do_filter: ! msg = sys.stdin.read() ! filtered = h.filter(msg) ! sys.stdout.write(filtered) if unknown: *************** *** 318,322 **** if len(unknown) > 1: print "Scoring", u ! score(bayes, u) if __name__ == "__main__": --- 425,429 ---- if len(unknown) > 1: print "Scoring", u ! score(h, u) if __name__ == "__main__": Index: hammiesrv.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammiesrv.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** hammiesrv.py 23 Sep 2002 21:20:10 -0000 1.2 --- hammiesrv.py 27 Sep 2002 19:40:22 -0000 1.3 *************** *** 3,139 **** # A server version of hammie.py - # Server code ! import SimpleXMLRPCServer ! import email ! import hammie ! from tokenizer import tokenize ! ! # Default header to add ! DFL_HEADER = "X-Hammie-Disposition" ! ! # Default spam cutoff ! DFL_CUTOFF = 0.9 ! ! class Hammie: ! def __init__(self, bayes): ! self.bayes = bayes ! def _scoremsg(self, msg, evidence=False): ! """Score an email.Message. ! Returns the probability the message is spam. If evidence is ! true, returns a tuple: (probability, clues), where clues is a ! list of the words which contributed to the score. ! """ ! return self.bayes.spamprob(tokenize(msg), evidence) ! def score(self, msg, evidence=False): ! """Score (judge) a message. ! Pass in a message as a string. ! Returns the probability the message is spam. If evidence is ! true, returns a tuple: (probability, clues), where clues is a ! list of the words which contributed to the score. """ ! return self._scoremsg(email.message_from_string(msg), evidence) ! ! def filter(self, msg, header=DFL_HEADER, cutoff=DFL_CUTOFF): ! """Score (judge) a message and add a disposition header. ! ! Pass in a message as a string. Optionally, set header to the ! name of the header to add, and/or cutoff to the probability ! value which must be met or exceeded for a message to get a 'Yes' ! disposition. ! ! Returns the same message with a new disposition header. ! ! """ ! msg = email.message_from_string(msg) ! prob, clues = self._scoremsg(msg, True) ! if prob < cutoff: ! disp = "No" else: ! disp = "Yes" ! disp += "; %.2f" % prob ! disp += "; " + hammie.formatclues(clues) ! msg.add_header(header, disp) ! return msg.as_string(unixfrom=(msg.get_unixfrom() is not None)) ! ! def train(self, msg, is_spam): ! """Train bayes with a message. ! ! msg should be the message as a string, and is_spam should be 1 ! if the message is spam, 0 if not. ! ! Probabilities are not updated after this call is made; to do ! that, call update_probabilities(). ! ! """ ! ! self.bayes.learn(tokenize(msg), is_spam, False) ! ! def train_ham(self, msg): ! """Train bayes with ham. ! ! msg should be the message as a string. ! ! Probabilities are not updated after this call is made; to do ! that, call update_probabilities(). ! ! """ ! ! self.train(msg, False) ! ! def train_spam(self, msg): ! """Train bayes with spam. ! ! msg should be the message as a string. ! ! Probabilities are not updated after this call is made; to do ! that, call update_probabilities(). ! ! """ ! self.train(msg, True) ! def update_probabilities(self): ! """Update probability values. - You would want to call this after a training session. It's - pretty slow, so if you have a lot of messages to train, wait - until you're all done before calling this. ! """ ! self.bayes.update_probabilites() ! def main(): ! usedb = True ! pck = "/home/neale/lib/hammie.db" ! if usedb: ! bayes = hammie.PersistentGrahamBayes(pck) ! else: ! bayes = None ! try: ! fp = open(pck, 'rb') ! except IOError, e: ! if e.errno <> errno.ENOENT: raise ! else: ! bayes = pickle.load(fp) ! fp.close() ! if bayes is None: ! import classifier ! bayes = classifier.GrahamBayes() ! server = SimpleXMLRPCServer.SimpleXMLRPCServer(("localhost", 7732)) ! server.register_instance(Hammie(bayes)) server.serve_forever() --- 3,121 ---- # A server version of hammie.py ! """Usage: %(program)s [options] IP:PORT ! Where: ! -h ! show usage and exit ! -p FILE ! use file as the persistent store. loads data from this file if it ! exists, and saves data to this file at the end. Default: %(DEFAULTDB)s ! -d ! use the DBM store instead of cPickle. The file is larger and ! creating it is slower, but checking against it is much faster, ! especially for large word databases. ! IP ! IP address to bind (use 0.0.0.0 to listen on all IPs of this machine) ! PORT ! Port number to listen to. ! """ ! import SimpleXMLRPCServer ! import getopt ! import sys ! import traceback ! import xmlrpclib ! import hammie ! program = sys.argv[0] # For usage(); referenced by docstring above ! # Default DB path ! DEFAULTDB = hammie.DEFAULTDB ! class HammieHandler(SimpleXMLRPCServer.SimpleXMLRPCRequestHandler): ! def do_POST(self): ! """Handles the HTTP POST request. ! Attempts to interpret all HTTP POST requests as XML-RPC calls, ! which are forwarded to the _dispatch method for handling. + This one also prints out tracebacks, to help me debug :) """ ! try: ! # get arguments ! data = self.rfile.read(int(self.headers["content-length"])) ! params, method = xmlrpclib.loads(data) ! # generate response ! try: ! response = self._dispatch(method, params) ! # wrap response in a singleton tuple ! response = (response,) ! except: ! # report exception back to server ! response = xmlrpclib.dumps( ! xmlrpclib.Fault(1, "%s:%s" % (sys.exc_type, sys.exc_value)) ! ) ! else: ! response = xmlrpclib.dumps(response, methodresponse=1) ! except: ! # internal error, report as HTTP server error ! traceback.print_exc() ! print `data` ! self.send_response(500) ! self.end_headers() else: ! # got a valid XML RPC response ! self.send_response(200) ! self.send_header("Content-type", "text/xml") ! self.send_header("Content-length", str(len(response))) ! self.end_headers() ! self.wfile.write(response) ! # shut down the connection ! self.wfile.flush() ! self.connection.shutdown(1) ! ! def usage(code, msg=''): ! """Print usage message and sys.exit(code).""" ! if msg: ! print >> sys.stderr, msg ! print >> sys.stderr ! print >> sys.stderr, __doc__ % globals() ! sys.exit(code) ! def main(): ! """Main program; parse options and go.""" ! try: ! opts, args = getopt.getopt(sys.argv[1:], 'hdp:') ! except getopt.error, msg: ! usage(2, msg) ! pck = DEFAULTDB ! usedb = False ! for opt, arg in opts: ! if opt == '-h': ! usage(0) ! elif opt == '-p': ! pck = arg ! elif opt == "-d": ! usedb = True ! if len(args) != 1: ! usage(2, "IP:PORT not specified") ! ip, port = args[0].split(":") ! port = int(port) ! ! bayes = hammie.createbayes(pck, usedb) ! h = hammie.Hammie(bayes) ! server = SimpleXMLRPCServer.SimpleXMLRPCServer((ip, port), HammieHandler) ! server.register_instance(h) server.serve_forever() Index: runtest.sh =================================================================== RCS file: /cvsroot/spambayes/spambayes/runtest.sh,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** runtest.sh 19 Sep 2002 00:17:41 -0000 1.3 --- runtest.sh 27 Sep 2002 19:40:22 -0000 1.4 *************** *** 40,43 **** --- 40,46 ---- case "$TEST" in + run1) + python timcv.py -n $SETS > run1.txt + ;; run2|useold) python timcv.py -n $SETS > run2.txt From tim_one@users.sourceforge.net Fri Sep 27 22:04:08 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Fri, 27 Sep 2002 14:04:08 -0700 Subject: [Spambayes-checkins] spambayes HistToGNU.py,1.5,1.6 TestDriver.py,1.15,1.16 hammie.py,1.23,1.24 hammiesrv.py,1.3,1.4 setup.py,1.5,1.6 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv3485 Modified Files: HistToGNU.py TestDriver.py hammie.py hammiesrv.py setup.py Log Message: Whitespace normalization, prior to tagging. Index: HistToGNU.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/HistToGNU.py,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** HistToGNU.py 26 Sep 2002 08:47:29 -0000 1.5 --- HistToGNU.py 27 Sep 2002 21:04:05 -0000 1.6 *************** *** 62,66 **** cmd.write('pause 100\n') print cmd.getvalue() ! def main(): import getopt --- 62,66 ---- cmd.write('pause 100\n') print cmd.getvalue() ! def main(): import getopt *************** *** 77,81 **** fname = "%s_%shist.pik" % (options.pickle_basename, f) args.append(fname) ! if args: plot(args) --- 77,81 ---- fname = "%s_%shist.pik" % (options.pickle_basename, f) args.append(fname) ! if args: plot(args) Index: TestDriver.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v retrieving revision 1.15 retrieving revision 1.16 diff -C2 -d -r1.15 -r1.16 *** TestDriver.py 27 Sep 2002 08:36:03 -0000 1.15 --- TestDriver.py 27 Sep 2002 21:04:06 -0000 1.16 *************** *** 209,213 **** if options.save_histogram_pickles: ! for f, h in (('ham', self.global_ham_hist), ('spam', self.global_spam_hist)): fname = "%s_%shist.pik" % (options.pickle_basename, f) --- 209,213 ---- if options.save_histogram_pickles: ! for f, h in (('ham', self.global_ham_hist), ('spam', self.global_spam_hist)): fname = "%s_%shist.pik" % (options.pickle_basename, f) Index: hammie.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammie.py,v retrieving revision 1.23 retrieving revision 1.24 diff -C2 -d -r1.23 -r1.24 *** hammie.py 27 Sep 2002 19:40:21 -0000 1.23 --- hammie.py 27 Sep 2002 21:04:06 -0000 1.24 *************** *** 182,186 **** """A spambayes mail filter""" ! def __init__(self, bayes): self.bayes = bayes --- 182,186 ---- """A spambayes mail filter""" ! def __init__(self, bayes): self.bayes = bayes *************** *** 198,202 **** return self.bayes.spamprob(tokenize(msg), evidence) ! def formatclues(self, clues, sep="; "): """Format the clues into something readable.""" --- 198,202 ---- return self.bayes.spamprob(tokenize(msg), evidence) ! def formatclues(self, clues, sep="; "): """Format the clues into something readable.""" *************** *** 230,234 **** cutoff to the probability value which must be met or exceeded for a message to get a 'Yes' disposition. ! Returns the same message with a new disposition header. --- 230,234 ---- cutoff to the probability value which must be met or exceeded for a message to get a 'Yes' disposition. ! Returns the same message with a new disposition header. *************** *** 258,264 **** Probabilities are not updated after this call is made; to do that, call update_probabilities(). ! """ ! self.bayes.learn(tokenize(msg), is_spam, False) --- 258,264 ---- Probabilities are not updated after this call is made; to do that, call update_probabilities(). ! """ ! self.bayes.learn(tokenize(msg), is_spam, False) *************** *** 295,301 **** """ ! self.bayes.update_probabilities() ! def train(hammie, msgs, is_spam): --- 295,301 ---- """ ! self.bayes.update_probabilities() ! def train(hammie, msgs, is_spam): Index: hammiesrv.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammiesrv.py,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** hammiesrv.py 27 Sep 2002 19:40:22 -0000 1.3 --- hammiesrv.py 27 Sep 2002 21:04:06 -0000 1.4 *************** *** 79,83 **** self.wfile.flush() self.connection.shutdown(1) ! def usage(code, msg=''): --- 79,83 ---- self.wfile.flush() self.connection.shutdown(1) ! def usage(code, msg=''): *************** *** 112,116 **** ip, port = args[0].split(":") port = int(port) ! bayes = hammie.createbayes(pck, usedb) h = hammie.Hammie(bayes) --- 112,116 ---- ip, port = args[0].split(":") port = int(port) ! bayes = hammie.createbayes(pck, usedb) h = hammie.Hammie(bayes) Index: setup.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/setup.py,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** setup.py 24 Sep 2002 18:07:17 -0000 1.5 --- setup.py 27 Sep 2002 21:04:06 -0000 1.6 *************** *** 2,6 **** setup( ! name='spambayes', scripts=['unheader.py', 'hammie.py', --- 2,6 ---- setup( ! name='spambayes', scripts=['unheader.py', 'hammie.py', From tim_one@users.sourceforge.net Fri Sep 27 22:18:20 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Fri, 27 Sep 2002 14:18:20 -0700 Subject: [Spambayes-checkins] spambayes TestDriver.py,1.16,1.17 Tester.py,1.4,1.5 classifier.py,1.20,1.21 hammie.py,1.24,1.25 neiltrain.py,1.2,1.3 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv8335 Modified Files: TestDriver.py Tester.py classifier.py hammie.py neiltrain.py Log Message: Renamed class GrahamBayes to Bayes. hammie.py may with to rename its derived class similarly. Index: TestDriver.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v retrieving revision 1.16 retrieving revision 1.17 diff -C2 -d -r1.16 -r1.17 *** TestDriver.py 27 Sep 2002 21:04:06 -0000 1.16 --- TestDriver.py 27 Sep 2002 21:18:18 -0000 1.17 *************** *** 161,165 **** def new_classifier(self): ! c = self.classifier = classifier.GrahamBayes() self.tester = Tester.Test(c) self.trained_ham_hist = Hist(options.nbuckets) --- 161,165 ---- def new_classifier(self): ! c = self.classifier = classifier.Bayes() self.tester = Tester.Test(c) self.trained_ham_hist = Hist(options.nbuckets) Index: Tester.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Tester.py,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** Tester.py 19 Sep 2002 06:30:15 -0000 1.4 --- Tester.py 27 Sep 2002 21:18:18 -0000 1.5 *************** *** 2,6 **** class Test: ! # Pass a classifier instance (an instance of GrahamBayes). # Loop: # # Train the classifer with new ham and spam. --- 2,6 ---- class Test: ! # Pass a classifier instance (an instance of Bayes). # Loop: # # Train the classifer with new ham and spam. *************** *** 128,132 **** _easy_test = """ ! >>> from classifier import GrahamBayes >>> good1 = _Example('', ['a', 'b', 'c'] * 10) --- 128,132 ---- _easy_test = """ ! >>> from classifier import Bayes >>> good1 = _Example('', ['a', 'b', 'c'] * 10) *************** *** 134,138 **** >>> bad1 = _Example('', ['d'] * 10) ! >>> t = Test(GrahamBayes()) >>> t.train([good1, good2], [bad1]) >>> t.predict([_Example('goodham', ['a', 'b']), --- 134,138 ---- >>> bad1 = _Example('', ['d'] * 10) ! >>> t = Test(Bayes()) >>> t.train([good1, good2], [bad1]) >>> t.predict([_Example('goodham', ['a', 'b']), Index: classifier.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/classifier.py,v retrieving revision 1.20 retrieving revision 1.21 diff -C2 -d -r1.20 -r1.21 *** classifier.py 24 Sep 2002 22:14:01 -0000 1.20 --- classifier.py 27 Sep 2002 21:18:18 -0000 1.21 *************** *** 217,221 **** self.spamprob) = t ! class GrahamBayes(object): __slots__ = ('wordinfo', # map word to WordInfo record 'nspam', # number of spam messages learn() has seen --- 217,221 ---- self.spamprob) = t ! class Bayes(object): __slots__ = ('wordinfo', # map word to WordInfo record 'nspam', # number of spam messages learn() has seen Index: hammie.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammie.py,v retrieving revision 1.24 retrieving revision 1.25 diff -C2 -d -r1.24 -r1.25 *** hammie.py 27 Sep 2002 21:04:06 -0000 1.24 --- hammie.py 27 Sep 2002 21:18:18 -0000 1.25 *************** *** 136,144 **** ! class PersistentGrahamBayes(classifier.GrahamBayes): ! """A persistent GrahamBayes classifier. ! This is just like classifier.GrahamBayes, except that the dictionary is a database. You take less disk this way, I think, and you can pretend it's persistent. It's much slower training, but much faster --- 136,144 ---- ! class PersistentGrahamBayes(classifier.Bayes): ! """A persistent Bayes classifier. ! This is just like classifier.Bayes, except that the dictionary is a database. You take less disk this way, I think, and you can pretend it's persistent. It's much slower training, but much faster *************** *** 161,165 **** def __init__(self, dbname): ! classifier.GrahamBayes.__init__(self) self.statekey = "saved state" self.wordinfo = DBDict(dbname, (self.statekey,)) --- 161,165 ---- def __init__(self, dbname): ! classifier.Bayes.__init__(self) self.statekey = "saved state" self.wordinfo = DBDict(dbname, (self.statekey,)) *************** *** 335,339 **** def createbayes(pck=DEFAULTDB, usedb=False): ! """Create a GrahamBayes instance for the given pickle (which doesn't have to exist). Create a PersistentGrahamBayes if usedb is True.""" --- 335,339 ---- def createbayes(pck=DEFAULTDB, usedb=False): ! """Create a Bayes instance for the given pickle (which doesn't have to exist). Create a PersistentGrahamBayes if usedb is True.""" *************** *** 350,354 **** fp.close() if bayes is None: ! bayes = classifier.GrahamBayes() return bayes --- 350,354 ---- fp.close() if bayes is None: ! bayes = classifier.Bayes() return bayes Index: neiltrain.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/neiltrain.py,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** neiltrain.py 20 Sep 2002 19:32:26 -0000 1.2 --- neiltrain.py 27 Sep 2002 21:18:18 -0000 1.3 *************** *** 39,43 **** ham_name = sys.argv[2] db_name = sys.argv[3] ! bayes = classifier.GrahamBayes() print 'Training with spam...' train(bayes, spam_name, True) --- 39,43 ---- ham_name = sys.argv[2] db_name = sys.argv[3] ! bayes = classifier.Bayes() print 'Training with spam...' train(bayes, spam_name, True) From tim_one@users.sourceforge.net Fri Sep 27 23:29:58 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Fri, 27 Sep 2002 15:29:58 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.34,1.35 classifier.py,1.21,1.22 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv29156 Modified Files: Options.py classifier.py Log Message: Gary's "f(w)" scheme is now the default, and code unique to the Graham scheme has gone away (but was tagged with Last-Graham). These options have vanished: hambias spambias min_spamprob max_spamprob unknown_word_spamprob use_robinson_combining use_robinson_probability use_robinson_ranking These options have changed default value: robinson_probability_a: 0.225 (was 1.0) robinson_minimum_prob_strength: 0.1 (was 0.0) max_discriminators: 150 (was 16) spam_cutoff: 0.570 (was 0.90) # THIS IS CORPUS-DEPENDENT! In addition, I did a little long-overdue refactoring of the classifier internals. The visible interface hasn't changed. Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.34 retrieving revision 1.35 diff -C2 -d -r1.34 -r1.35 *** Options.py 27 Sep 2002 04:02:59 -0000 1.34 --- Options.py 27 Sep 2002 22:29:56 -0000 1.35 *************** *** 100,110 **** # A message is considered spam iff it scores greater than spam_cutoff. ! # If using Graham's combining scheme, 0.90 seems to work best for "small" ! # training sets. As the size of the training sets increase, there's not ! # yet any bound in sight for how low this can go (0.075 would work as ! # well as 0.90 on Tim's large c.l.py data). ! # For Gary Robinson's scheme, some value between 0.50 and 0.60 has worked ! # best in all reports so far. ! spam_cutoff: 0.90 # Number of buckets in histograms. --- 100,106 ---- # A message is considered spam iff it scores greater than spam_cutoff. ! # This is corpus-dependent, and values into the .600's have been known ! # to work best on some data. ! spam_cutoff: 0.570 # Number of buckets in histograms. *************** *** 174,219 **** [Classifier] ! # Fiddling these can have extreme effects. See classifier.py for comments. ! hambias: 2.0 ! spambias: 1.0 ! ! min_spamprob: 0.01 ! max_spamprob: 0.99 ! unknown_spamprob: 0.5 ! ! max_discriminators: 16 ! ! ########################################################################### ! # Speculative options for Gary Robinson's ideas. These may go away, or ! # a bunch of incompatible stuff above may go away. ! ! # Use Gary's scheme for combining probabilities. ! use_robinson_combining: False ! # Use Gary's scheme for computing probabilities, along with its "a" and ! # "x" parameters. ! use_robinson_probability: False ! robinson_probability_a: 1.0 robinson_probability_x: 0.5 - # Use Gary's scheme for ranking probabilities. - use_robinson_ranking: False - # When scoring a message, ignore all words with # abs(word.spamprob - 0.5) < robinson_minimum_prob_strength. ! # By default (0.0), nothing is ignored. ! # Tim got a pretty clear improvement in f-n rate on his hasn't-improved-in- ! # a-long-time large c.l.py test by using 0.1. No other values have been ! # tried yet. ! # Neil Schemenauer also reported good results from 0.1, making the all- ! # Robinson scheme match the all-default Graham-like scheme on a smaller ! # and different corpus. ! # NOTE: Changing this may change the best spam_cutoff value for your ! # corpus. Since one effect is to separate the means more, you'll probably ! # want a higher spam_cutoff. ! robinson_minimum_prob_strength: 0.0 ########################################################################### ! # More speculative options for Gary Robinson's central-limit. These may go # away, or a bunch of incompatible stuff above may go away. --- 170,204 ---- [Classifier] ! # The maximum number of extreme words to look at in a msg, where "extreme" ! # means with spamprob farthest away from 0.5. 150 appears to work well ! # across all corpora tested. ! max_discriminators: 150 ! # These two control the prior assumption about word probabilities. ! # "x" is essentially the probability given to a word that's never been ! # seen before. Nobody has reported an improvement via moving it away ! # from 1/2. ! # "a" adjusts how much weight to give the prior assumption relative to ! # the probabilities estimated by counting. At a=0, the counting estimates ! # are believed 100%, even to the extent of assigning certainty (0 or 1) ! # to a word that's appeared in only ham or only spam. This is a disaster. ! # As "a" tends toward infintity, all probabilities tend toward "x". All ! # reports were that a value near 0.2 worked best, so this doesn't seem to ! # be corpus-dependent. ! # XXX Gary Robinson has since renamed "a" to "s", and redone his formulas ! # XXX to make it a measure of belief strength rather than "a number" from ! # XXX 0 to infinity. We haven't caught up to that yet. ! robinson_probability_a: 0.225 robinson_probability_x: 0.5 # When scoring a message, ignore all words with # abs(word.spamprob - 0.5) < robinson_minimum_prob_strength. ! # This may be a hack, but it has proved to reduce error rates in many ! # tests over Robinson's base scheme. 0.1 appeared to work well across ! # all corpora. ! robinson_minimum_prob_strength: 0.1 ########################################################################### ! # Speculative options for Gary Robinson's central-limit ideas. These may go # away, or a bunch of incompatible stuff above may go away. *************** *** 268,282 **** 'best_cutoff_fp_weight': float_cracker, }, ! 'Classifier': {'hambias': float_cracker, ! 'spambias': float_cracker, ! 'min_spamprob': float_cracker, ! 'max_spamprob': float_cracker, ! 'unknown_spamprob': float_cracker, ! 'max_discriminators': int_cracker, ! 'use_robinson_combining': boolean_cracker, ! 'use_robinson_probability': boolean_cracker, 'robinson_probability_a': float_cracker, 'robinson_probability_x': float_cracker, - 'use_robinson_ranking': boolean_cracker, 'robinson_minimum_prob_strength': float_cracker, --- 253,259 ---- 'best_cutoff_fp_weight': float_cracker, }, ! 'Classifier': {'max_discriminators': int_cracker, 'robinson_probability_a': float_cracker, 'robinson_probability_x': float_cracker, 'robinson_minimum_prob_strength': float_cracker, Index: classifier.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/classifier.py,v retrieving revision 1.21 retrieving revision 1.22 diff -C2 -d -r1.21 -r1.22 *** classifier.py 27 Sep 2002 21:18:18 -0000 1.21 --- classifier.py 27 Sep 2002 22:29:56 -0000 1.22 *************** *** 1,178 **** ! # This is an implementation of the Bayes-like spam classifier sketched ! # by Paul Graham at . We say ! # "Bayes-like" because there are many ad hoc deviations from a ! # "normal" Bayesian classifier. ! # ! # This implementation is due to Tim Peters et alia. ! ! import time ! from heapq import heapreplace ! from sets import Set ! ! from Options import options ! ! # The count of each word in ham is artificially boosted by a factor of ! # HAMBIAS, and similarly for SPAMBIAS. Graham uses 2.0 and 1.0. Final ! # results are very sensitive to the HAMBIAS value. On my 5x5 c.l.py ! # test grid with 20,000 hams and 13,750 spams split into 5 pairs, then ! # across all 20 test runs (for each pair, training on that pair then scoring ! # against the other 4 pairs), and counting up all the unique msgs ever ! # identified as false negative or positive, then compared to HAMBIAS 2.0, ! # ! # At HAMBIAS 1.0 ! # total unique false positives goes up by a factor of 7.6 ( 23 -> 174) ! # total unique false negatives goes down by a factor of 2 (337 -> 166) ! # ! # At HAMBIAS 3.0 ! # total unique false positives goes down by a factor of 4.6 ( 23 -> 5) ! # total unique false negatives goes up by a factor of 2.1 (337 -> 702) ! ! HAMBIAS = options.hambias # 2.0 ! SPAMBIAS = options.spambias # 1.0 ! ! # "And then there is the question of what probability to assign to words ! # that occur in one corpus but not the other. Again by trial and error I ! # chose .01 and .99.". However, the code snippet clamps *all* probabilities ! # into this range. That's good in principle (IMO), because no finite amount ! # of training data is good enough to justify probabilities of 0 or 1. It ! # may justify probabilities outside this range, though. ! MIN_SPAMPROB = options.min_spamprob # 0.01 ! MAX_SPAMPROB = options.max_spamprob # 0.99 ! ! # The spam probability assigned to words never seen before. Graham used ! # 0.2 here. Neil Schemenauer reported that 0.5 seemed to work better. In ! # Tim's content-only tests (no headers), boosting to 0.5 cut the false ! # negative rate by over 1/3. The f-p rate increased, but there were so few ! # f-ps that the increase wasn't statistically significant. It also caught ! # 13 more spams erroneously classified as ham. By eyeball (and common ! # sense ), this has most effect on very short messages, where there ! # simply aren't many high-value words. A word with prob 0.5 is (in effect) ! # completely ignored by spamprob(), in favor of *any* word with *any* prob ! # differing from 0.5. At 0.2, an unknown word favors ham at the expense ! # of kicking out a word with a prob in (0.2, 0.8), and that seems dubious ! # on the face of it. ! UNKNOWN_SPAMPROB = options.unknown_spamprob # 0.5 ! ! # "I only consider words that occur more than five times in total". ! # But the code snippet considers words that appear at least five times. ! # This implementation follows the code rather than the explanation. ! # (In addition, the count compared is after multiplying it with the ! # appropriate bias factor.) ! # ! # Twist: Graham used MINCOUNT=5.0 here. I got rid of it: in effect, ! # given HAMBIAS=2.0, it meant we ignored a possibly perfectly good piece ! # of spam evidence unless it appeared at least 5 times, and ditto for ! # ham evidence unless it appeared at least 3 times. That certainly does ! # bias in favor of ham, but multiple distortions in favor of ham are ! # multiple ways to get confused and trip up. Here are the test results ! # before and after, MINCOUNT=5.0 on the left, no MINCOUNT on the right; ! # ham sets had 4000 msgs (so 0.025% is one msg), and spam sets 2750: ! # ! # false positive percentages ! # 0.000 0.000 tied ! # 0.000 0.000 tied ! # 0.100 0.050 won -50.00% ! # 0.000 0.025 lost +(was 0) ! # 0.025 0.075 lost +200.00% ! # 0.025 0.000 won -100.00% ! # 0.100 0.100 tied ! # 0.025 0.050 lost +100.00% ! # 0.025 0.025 tied ! # 0.050 0.025 won -50.00% ! # 0.100 0.050 won -50.00% ! # 0.025 0.050 lost +100.00% ! # 0.025 0.050 lost +100.00% ! # 0.025 0.000 won -100.00% ! # 0.025 0.000 won -100.00% ! # 0.025 0.075 lost +200.00% ! # 0.025 0.025 tied ! # 0.000 0.000 tied ! # 0.025 0.025 tied ! # 0.100 0.050 won -50.00% # ! # won 7 times ! # tied 7 times ! # lost 6 times # ! # total unique fp went from 9 to 13 # ! # false negative percentages ! # 0.364 0.327 won -10.16% ! # 0.400 0.400 tied ! # 0.400 0.327 won -18.25% ! # 0.909 0.691 won -23.98% ! # 0.836 0.545 won -34.81% ! # 0.618 0.291 won -52.91% ! # 0.291 0.218 won -25.09% ! # 1.018 0.654 won -35.76% ! # 0.982 0.364 won -62.93% ! # 0.727 0.291 won -59.97% ! # 0.800 0.327 won -59.13% ! # 1.163 0.691 won -40.58% ! # 0.764 0.582 won -23.82% ! # 0.473 0.291 won -38.48% ! # 0.473 0.364 won -23.04% ! # 0.727 0.436 won -40.03% ! # 0.655 0.436 won -33.44% ! # 0.509 0.218 won -57.17% ! # 0.545 0.291 won -46.61% ! # 0.509 0.254 won -50.10% # ! # won 19 times ! # tied 1 times ! # lost 0 times # ! # total unique fn went from 168 to 106 # ! # So dropping MINCOUNT was a huge win for the f-n rate, and a mixed bag ! # for the f-p rate (but the f-p rate was so low compared to 4000 msgs that ! # even the losses were barely significant). In addition, dropping MINCOUNT ! # had a larger good effect when using random training subsets of size 500; ! # this makes intuitive sense, as with less training data it was harder to ! # exceed the MINCOUNT threshold. # ! # Still, MINCOUNT seemed to be a gross approximation to *something* valuable: ! # a strong clue appearing in 1,000 training msgs is certainly more trustworthy ! # than an equally strong clue appearing in only 1 msg. I'm almost certain it ! # would pay to develop a way to take that into account when scoring. In ! # particular, there was a very specific new class of false positives ! # introduced by dropping MINCOUNT: some c.l.py msgs consisting mostly of ! # Spanish or French. The "high probability" spam clues were innocuous ! # words like "puedo" and "como", that appeared in very rare Spanish and ! # French spam too. There has to be a more principled way to address this ! # than the MINCOUNT hammer, and the test results clearly showed that MINCOUNT ! # did more harm than good overall. ! # The maximum number of words spamprob() pays attention to. Graham had 15 ! # here. If there are 8 indicators with spam probabilities near 1, and 7 ! # near 0, the math is such that the combined result is near 1. Making this ! # even gets away from that oddity (8 of each allows for graceful ties, ! # which favor ham). ! # ! # XXX That should be revisited. Stripping HTML tags from plain text msgs ! # XXX later addressed some of the same problem cases. The best value for ! # XXX MAX_DISCRIMINATORS remains unknown, but increasing it a lot is known ! # XXX to hurt. ! # XXX Later: tests after cutting this back to 15 showed no effect on the ! # XXX f-p rate, and a tiny shift in the f-n rate (won 3 times, tied 8 times, ! # XXX lost 9 times). There isn't a significant difference, so leaving it ! # XXX at 16. ! # ! # A twist: When staring at failures, it wasn't unusual to see the top ! # discriminators *all* have values of MIN_SPAMPROB and MAX_SPAMPROB. The ! # math is such that one MIN_SPAMPROB exactly cancels out one MAX_SPAMPROB, ! # yielding no info at all. Then whichever flavor of clue happened to reach ! # MAX_DISCRIMINATORS//2 + 1 occurrences first determined the final outcome, ! # based on almost no real evidence. ! # ! # So spamprob() was changed to save lists of *all* MIN_SPAMPROB and ! # MAX_SPAMPROB clues. If the number of those are equal, they're all ignored. ! # Else the flavor with the smaller number of instances "cancels out" the ! # same number of instances of the other flavor, and the remaining instances ! # of the other flavor are fed into the probability computation. This change ! # was a pure win, lowering the false negative rate consistently, and it even ! # managed to tickle a couple rare false positives into "not spam" terrority. ! MAX_DISCRIMINATORS = options.max_discriminators # 16 PICKLE_VERSION = 1 --- 1,36 ---- ! # An implementation of a Bayes-like spam classifier. # ! # Paul Graham's original description: # ! # http://www.paulgraham.com/spam.html # ! # A highly fiddled version of that can be retrieved from our CVS repository, ! # via tag Last-Graham. This made many demonstrated improvements in error ! # rates over Paul's original description. # ! # This code implements Gary Robinson's suggestions, which are well explained ! # on his webpage: # ! # http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html # ! # This is theoretically cleaner, and in testing has performed at least as ! # well as our highly tuned Graham scheme did, often slightly better, and ! # sometimes much better. It also has "a middle ground", which people like: ! # the scores under Paul's scheme were almost always very near 0 or very near ! # 1, whether or not the classification was correct. The false positives ! # and false negatives under Gary's scheme generally score in a narrow range ! # around the corpus's best spam_cutoff value # ! # This implementation is due to Tim Peters et alia. + import time + from heapq import heapreplace + from sets import Set ! from Options import options ! ! # The maximum number of extreme words to look at in a msg, where "extreme" ! # means with spamprob farthest away from 0.5. ! MAX_DISCRIMINATORS = options.max_discriminators # 150 PICKLE_VERSION = 1 *************** *** 273,359 **** """ ! # A priority queue to remember the MAX_DISCRIMINATORS best ! # probabilities, where "best" means largest distance from 0.5. ! # The tuples are (distance, prob, word, wordinfo[word]). ! nbest = [(-1.0, None, None, None)] * MAX_DISCRIMINATORS ! smallest_best = -1.0 ! ! wordinfoget = self.wordinfo.get ! now = time.time() ! mins = [] # all words w/ prob MIN_SPAMPROB ! maxs = [] # all words w/ prob MAX_SPAMPROB ! # Counting a unique word multiple times hurts, although counting one ! # at most two times had some benefit whan UNKNOWN_SPAMPROB was 0.2. ! # When that got boosted to 0.5, counting more than once became ! # counterproductive. ! for word in Set(wordstream): ! record = wordinfoget(word) ! if record is None: ! prob = UNKNOWN_SPAMPROB ! else: ! record.atime = now ! prob = record.spamprob ! ! distance = abs(prob - 0.5) ! if prob == MIN_SPAMPROB: ! mins.append((distance, prob, word, record)) ! elif prob == MAX_SPAMPROB: ! maxs.append((distance, prob, word, record)) ! elif distance > smallest_best: ! # Subtle: we didn't use ">" instead of ">=" just to save ! # calls to heapreplace(). The real intent is that if ! # there are many equally strong indicators throughout the ! # message, we want to favor the ones that appear earliest: ! # it's expected that spam headers will often have smoking ! # guns, and, even when not, spam has to grab your attention ! # early (& note that when spammers generate large blocks of ! # random gibberish to throw off exact-match filters, it's ! # always at the end of the msg -- if they put it at the ! # start, *nobody* would read the msg). ! heapreplace(nbest, (distance, prob, word, record)) ! smallest_best = nbest[0][0] ! ! # Compute the probability. Note: This is what Graham's code did, ! # but it's dubious for reasons explained in great detail on Python- ! # Dev: it's missing P(spam) and P(not-spam) adjustments that ! # straightforward Bayesian analysis says should be here. It's ! # unclear how much it matters, though, as the omissions here seem ! # to tend in part to cancel out distortions introduced earlier by ! # HAMBIAS. Experiments will decide the issue. ! clues = [] ! # First cancel out competing extreme clues (see comment block at ! # MAX_DISCRIMINATORS declaration -- this is a twist on Graham). ! if mins or maxs: ! if len(mins) < len(maxs): ! shorter, longer = mins, maxs ! else: ! shorter, longer = maxs, mins ! tokeep = min(len(longer) - len(shorter), MAX_DISCRIMINATORS) ! # They're all good clues, but we're only going to feed the tokeep ! # initial clues from the longer list into the probability ! # computation. ! for dist, prob, word, record in shorter + longer[tokeep:]: ! record.killcount += 1 ! if evidence: ! clues.append((word, prob)) ! for x in longer[:tokeep]: ! heapreplace(nbest, x) ! prob_product = inverse_prob_product = 1.0 ! for distance, prob, word, record in nbest: ! if prob is None: # it's one of the dummies nbest started with ! continue if record is not None: # else wordinfo doesn't know about it record.killcount += 1 ! if evidence: ! clues.append((word, prob)) ! prob_product *= prob ! inverse_prob_product *= 1.0 - prob ! prob = prob_product / (prob_product + inverse_prob_product) if evidence: ! clues.sort(lambda a, b: cmp(a[1], b[1])) return prob, clues else: --- 131,184 ---- """ ! from math import frexp ! # This combination method is due to Gary Robinson; see ! # http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html ! # The real P = this P times 2**Pexp. Likewise for Q. We're ! # simulating unbounded dynamic float range by hand. If this pans ! # out, *maybe* we should store logarithms in the database instead ! # and just add them here. But I like keeping raw counts in the ! # database (they're easy to understand, manipulate and combine), ! # and there's no evidence that this simulation is a significant ! # expense. ! P = Q = 1.0 ! Pexp = Qexp = 0 ! clues = self._getclues(wordstream) ! for prob, word, record in clues: if record is not None: # else wordinfo doesn't know about it record.killcount += 1 ! P *= 1.0 - prob ! Q *= prob ! if P < 1e-200: # move back into range ! P, e = frexp(P) ! Pexp += e ! if Q < 1e-200: # move back into range ! Q, e = frexp(Q) ! Qexp += e ! P, e = frexp(P) ! Pexp += e ! Q, e = frexp(Q) ! Qexp += e ! ! num_clues = len(clues) ! if num_clues: ! #P = 1.0 - P**(1./num_clues) ! #Q = 1.0 - Q**(1./num_clues) ! # ! # (x*2**e)**n = x**n * 2**(e*n) ! n = 1.0 / num_clues ! P = 1.0 - P**n * 2.0**(Pexp * n) ! Q = 1.0 - Q**n * 2.0**(Qexp * n) ! ! prob = (P-Q)/(P+Q) # in -1 .. 1 ! prob = 0.5 + prob/2 # shift to 0 .. 1 ! else: ! prob = 0.5 if evidence: ! clues.sort() ! clues = [(w, p) for p, w, r in clues] return prob, clues else: *************** *** 403,418 **** nham = float(self.nham or 1) nspam = float(self.nspam or 1) ! for word,record in self.wordinfo.iteritems(): # Compute prob(msg is spam | msg contains word). ! hamcount = min(HAMBIAS * record.hamcount, nham) ! spamcount = min(SPAMBIAS * record.spamcount, nspam) hamratio = hamcount / nham spamratio = spamcount / nspam prob = spamratio / (hamratio + spamratio) ! if prob < MIN_SPAMPROB: ! prob = MIN_SPAMPROB ! elif prob > MAX_SPAMPROB: ! prob = MAX_SPAMPROB if record.spamprob != prob: --- 228,257 ---- nham = float(self.nham or 1) nspam = float(self.nspam or 1) ! A = options.robinson_probability_a ! X = options.robinson_probability_x ! AoverX = A/X ! for word, record in self.wordinfo.iteritems(): # Compute prob(msg is spam | msg contains word). ! # This is the Graham calculation, but stripped of biases, and ! # stripped of clamping into 0.01 thru 0.99. The Bayesian ! # adjustment following keeps them in a sane range, and one ! # that naturally grows the more evidence there is to back up ! # a probability. ! hamcount = min(record.hamcount, nham) hamratio = hamcount / nham + + spamcount = min(record.spamcount, nspam) spamratio = spamcount / nspam prob = spamratio / (hamratio + spamratio) ! ! # Now do Robinson's Bayesian adjustment. ! # ! # a + (n * p(w)) ! # f(w) = --------------- ! # (a / x) + n ! ! n = hamcount + spamcount ! prob = (A + n * prob) / (AoverX + n) if record.spamprob != prob: *************** *** 481,487 **** pass - # XXX More stuff should be reworked to use this as a helper function. def _getclues(self, wordstream): mindist = options.robinson_minimum_prob_strength # A priority queue to remember the MAX_DISCRIMINATORS best --- 320,326 ---- pass def _getclues(self, wordstream): mindist = options.robinson_minimum_prob_strength + unknown = options.robinson_probability_x # A priority queue to remember the MAX_DISCRIMINATORS best *************** *** 496,504 **** record = wordinfoget(word) if record is None: ! prob = UNKNOWN_SPAMPROB else: record.atime = now prob = record.spamprob - distance = abs(prob - 0.5) if distance >= mindist and distance > smallest_best: --- 335,342 ---- record = wordinfoget(word) if record is None: ! prob = unknown else: record.atime = now prob = record.spamprob distance = abs(prob - 0.5) if distance >= mindist and distance > smallest_best: *************** *** 506,513 **** smallest_best = nbest[0][0] ! clues = [(prob, word, record) ! for distance, prob, word, record in nbest ! if prob is not None] ! return clues #************************************************************************ --- 344,349 ---- smallest_best = nbest[0][0] ! # Return (prob, word, record) for the non-dummies. ! return [t[1:] for t in nbest if t[1] is not None] #************************************************************************ *************** *** 518,664 **** # to only one of the alternatives surviving. - def robinson_spamprob(self, wordstream, evidence=False): - """Return best-guess probability that wordstream is spam. - - wordstream is an iterable object producing words. - The return value is a float in [0.0, 1.0]. - - If optional arg evidence is True, the return value is a pair - probability, evidence - where evidence is a list of (word, probability) pairs. - """ - - from math import frexp - mindist = options.robinson_minimum_prob_strength - - # A priority queue to remember the MAX_DISCRIMINATORS best - # probabilities, where "best" means largest distance from 0.5. - # The tuples are (distance, prob, word, wordinfo[word]). - nbest = [(-1.0, None, None, None)] * MAX_DISCRIMINATORS - smallest_best = -1.0 - - wordinfoget = self.wordinfo.get - now = time.time() - for word in Set(wordstream): - record = wordinfoget(word) - if record is None: - prob = UNKNOWN_SPAMPROB - else: - record.atime = now - prob = record.spamprob - - distance = abs(prob - 0.5) - if distance >= mindist and distance > smallest_best: - heapreplace(nbest, (distance, prob, word, record)) - smallest_best = nbest[0][0] - - # Compute the probability. - clues = [] - - # This combination method is due to Gary Robinson. - # http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html - # In preliminary tests, it did just as well as Graham's scheme, - # but creates a definite "middle ground" around 0.5 where false - # negatives and false positives can actually found in non-trivial - # number. - - # The real P = this P times 2**Pexp. Likewise for Q. We're - # simulating unbounded dynamic float range by hand. If this pans - # out, *maybe* we should store logarithms in the database instead - # and just add them here. - P = Q = 1.0 - Pexp = Qexp = 0 - num_clues = 0 - for distance, prob, word, record in nbest: - if prob is None: # it's one of the dummies nbest started with - continue - if record is not None: # else wordinfo doesn't know about it - record.killcount += 1 - if evidence: - clues.append((word, prob)) - num_clues += 1 - P *= 1.0 - prob - Q *= prob - if P < 1e-200: # move back into range - P, e = frexp(P) - Pexp += e - if Q < 1e-200: # move back into range - Q, e = frexp(Q) - Qexp += e - - P, e = frexp(P) - Pexp += e - Q, e = frexp(Q) - Qexp += e - - if num_clues: - #P = 1.0 - P**(1./num_clues) - #Q = 1.0 - Q**(1./num_clues) - # - # (x*2**e)**n = x**n * 2**(e*n) - n = 1.0 / num_clues - P = 1.0 - P**n * 2.0**(Pexp * n) - Q = 1.0 - Q**n * 2.0**(Qexp * n) - - prob = (P-Q)/(P+Q) # in -1 .. 1 - prob = 0.5 + prob/2 # shift to 0 .. 1 - else: - prob = 0.5 - - if evidence: - clues.sort(lambda a, b: cmp(a[1], b[1])) - return prob, clues - else: - return prob - - if options.use_robinson_combining: - spamprob = robinson_spamprob - - def robinson_update_probabilities(self): - """Update the word probabilities in the spam database. - - This computes a new probability for every word in the database, - so can be expensive. learn() and unlearn() update the probabilities - each time by default. Thay have an optional argument that allows - to skip this step when feeding in many messages, and in that case - you should call update_probabilities() after feeding the last - message and before calling spamprob(). - """ - - nham = float(self.nham or 1) - nspam = float(self.nspam or 1) - A = options.robinson_probability_a - X = options.robinson_probability_x - AoverX = A/X - for word, record in self.wordinfo.iteritems(): - # Compute prob(msg is spam | msg contains word). - # This is the Graham calculation, but stripped of biases, and - # of clamping into 0.01 thru 0.99. - hamcount = min(record.hamcount, nham) - hamratio = hamcount / nham - - spamcount = min(record.spamcount, nspam) - spamratio = spamcount / nspam - - prob = spamratio / (hamratio + spamratio) - - # Now do Robinson's Bayesian adjustment. - # - # a + (n * p(w)) - # f(w) = --------------- - # (a / x) + n - - n = hamcount + spamcount - prob = (A + n * prob) / (AoverX + n) - - if record.spamprob != prob: - record.spamprob = prob - # The next seemingly pointless line appears to be a hack - # to allow a persistent db to realize the record has changed. - self.wordinfo[word] = record - - if options.use_robinson_probability: - update_probabilities = robinson_update_probabilities - def central_limit_compute_population_stats(self, msgstream, is_spam): from math import ldexp --- 354,357 ---- *************** *** 745,751 **** if options.use_central_limit: spamprob = central_limit_spamprob - - - def central_limit_compute_population_stats2(self, msgstream, is_spam): --- 438,441 ---- From montanaro@users.sourceforge.net Fri Sep 27 23:30:25 2002 From: montanaro@users.sourceforge.net (Skip Montanaro) Date: Fri, 27 Sep 2002 15:30:25 -0700 Subject: [Spambayes-checkins] spambayes setup.py,1.6,1.7 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv29785 Modified Files: setup.py Log Message: add several new scripts and a couple new modules Index: setup.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/setup.py,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** setup.py 27 Sep 2002 21:04:06 -0000 1.6 --- setup.py 27 Sep 2002 22:30:23 -0000 1.7 *************** *** 5,8 **** --- 5,9 ---- scripts=['unheader.py', 'hammie.py', + 'hammiesrv.py', 'loosecksum.py', 'timtest.py', *************** *** 11,18 **** --- 12,25 ---- 'runtest.sh', 'rebal.py', + 'HistToGNU.py', + 'mboxcount.py', + 'mboxtest.py', + 'neiltrain.py', 'cmp.py', 'rates.py'], py_modules=['classifier', 'tokenizer', + 'hammie', + 'msgs', 'Options', 'Tester', From npickett@users.sourceforge.net Fri Sep 27 23:38:56 2002 From: npickett@users.sourceforge.net (Neale Pickett) Date: Fri, 27 Sep 2002 15:38:56 -0700 Subject: [Spambayes-checkins] spambayes hammie.py,1.25,1.26 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv32050 Modified Files: hammie.py Log Message: * PersistentGrahamBayes -> PersistentBayes, reflecting change in classifier naming. Index: hammie.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/hammie.py,v retrieving revision 1.25 retrieving revision 1.26 diff -C2 -d -r1.25 -r1.26 *** hammie.py 27 Sep 2002 21:18:18 -0000 1.25 --- hammie.py 27 Sep 2002 22:38:53 -0000 1.26 *************** *** 136,140 **** ! class PersistentGrahamBayes(classifier.Bayes): """A persistent Bayes classifier. --- 136,140 ---- ! class PersistentBayes(classifier.Bayes): """A persistent Bayes classifier. *************** *** 336,343 **** def createbayes(pck=DEFAULTDB, usedb=False): """Create a Bayes instance for the given pickle (which ! doesn't have to exist). Create a PersistentGrahamBayes if usedb is True.""" if usedb: ! bayes = PersistentGrahamBayes(pck) else: bayes = None --- 336,343 ---- def createbayes(pck=DEFAULTDB, usedb=False): """Create a Bayes instance for the given pickle (which ! doesn't have to exist). Create a PersistentBayes if usedb is True.""" if usedb: ! bayes = PersistentBayes(pck) else: bayes = None From tim_one@users.sourceforge.net Sat Sep 28 04:41:12 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Fri, 27 Sep 2002 20:41:12 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.35,1.36 classifier.py,1.22,1.23 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv6007 Modified Files: Options.py classifier.py Log Message: Gary Robinson changed the forumla he uses to adjust the Graham probabilities since we first implemented it. The new formula is identical to the old in what it computes, but it looks a little different and is easier to understand. As a result, robinson_probability_a no longer exists, and robinson_probability_s takes its place (the "s" is for "strength"). If you used non-default values of a and/or x before, x doesn't change, but you should set robinson_probability_s to robinson_probability_a / robinson_probability_x. For example, before this checkin, the defaults were a=0.225 and x= 0.5. Now 'a' is gone, and s defaults to 0.225/0.5 = 0.45. Computed results are identical. Sorry for the hassle, but Gary's webpage does a very nice job of explaining this formula, and I really don't want to reword it all for this project -- keeping an obvious connection between our implementation and Gary's explanation is worth the disruption. Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.35 retrieving revision 1.36 diff -C2 -d -r1.35 -r1.36 *** Options.py 27 Sep 2002 22:29:56 -0000 1.35 --- Options.py 28 Sep 2002 03:41:10 -0000 1.36 *************** *** 179,194 **** # seen before. Nobody has reported an improvement via moving it away # from 1/2. ! # "a" adjusts how much weight to give the prior assumption relative to ! # the probabilities estimated by counting. At a=0, the counting estimates # are believed 100%, even to the extent of assigning certainty (0 or 1) # to a word that's appeared in only ham or only spam. This is a disaster. ! # As "a" tends toward infintity, all probabilities tend toward "x". All ! # reports were that a value near 0.2 worked best, so this doesn't seem to # be corpus-dependent. ! # XXX Gary Robinson has since renamed "a" to "s", and redone his formulas ! # XXX to make it a measure of belief strength rather than "a number" from ! # XXX 0 to infinity. We haven't caught up to that yet. ! robinson_probability_a: 0.225 robinson_probability_x: 0.5 # When scoring a message, ignore all words with --- 179,194 ---- # seen before. Nobody has reported an improvement via moving it away # from 1/2. ! # "s" adjusts how much weight to give the prior assumption relative to ! # the probabilities estimated by counting. At s=0, the counting estimates # are believed 100%, even to the extent of assigning certainty (0 or 1) # to a word that's appeared in only ham or only spam. This is a disaster. ! # As s tends toward infintity, all probabilities tend toward x. All ! # reports were that a value near 0.4 worked best, so this doesn't seem to # be corpus-dependent. ! # NOTE: Gary Robinson previously used a different formula involving 'a' ! # and 'x'. The 'x' here is the same as before. The 's' here is the old ! # 'a' divided by 'x'. robinson_probability_x: 0.5 + robinson_probability_s: 0.45 # When scoring a message, ignore all words with *************** *** 254,259 **** }, 'Classifier': {'max_discriminators': int_cracker, - 'robinson_probability_a': float_cracker, 'robinson_probability_x': float_cracker, 'robinson_minimum_prob_strength': float_cracker, --- 254,259 ---- }, 'Classifier': {'max_discriminators': int_cracker, 'robinson_probability_x': float_cracker, + 'robinson_probability_s': float_cracker, 'robinson_minimum_prob_strength': float_cracker, Index: classifier.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/classifier.py,v retrieving revision 1.22 retrieving revision 1.23 diff -C2 -d -r1.22 -r1.23 *** classifier.py 27 Sep 2002 22:29:56 -0000 1.22 --- classifier.py 28 Sep 2002 03:41:10 -0000 1.23 *************** *** 228,234 **** nham = float(self.nham or 1) nspam = float(self.nspam or 1) ! A = options.robinson_probability_a ! X = options.robinson_probability_x ! AoverX = A/X for word, record in self.wordinfo.iteritems(): # Compute prob(msg is spam | msg contains word). --- 228,233 ---- nham = float(self.nham or 1) nspam = float(self.nspam or 1) ! S = options.robinson_probability_s ! StimesX = S * options.robinson_probability_x for word, record in self.wordinfo.iteritems(): # Compute prob(msg is spam | msg contains word). *************** *** 248,257 **** # Now do Robinson's Bayesian adjustment. # ! # a + (n * p(w)) ! # f(w) = --------------- ! # (a / x) + n n = hamcount + spamcount ! prob = (A + n * prob) / (AoverX + n) if record.spamprob != prob: --- 247,256 ---- # Now do Robinson's Bayesian adjustment. # ! # s*x + n*p(w) ! # f(w) = -------------- ! # s + n n = hamcount + spamcount ! prob = (StimesX + n * prob) / (S + n) if record.spamprob != prob: From tim_one@users.sourceforge.net Sat Sep 28 04:44:17 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Fri, 27 Sep 2002 20:44:17 -0700 Subject: [Spambayes-checkins] spambayes TestDriver.py,1.17,1.18 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv7256 Modified Files: TestDriver.py Log Message: Hist.display(): reduced the # of columns devoted to showing the bucket boundaries by 1, and added a column to the histogram proper. There are enough boundary columns remaining to distinguish 1000 buckets, and even I never use that many . Index: TestDriver.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v retrieving revision 1.17 retrieving revision 1.18 diff -C2 -d -r1.17 -r1.18 *** TestDriver.py 27 Sep 2002 21:18:18 -0000 1.17 --- TestDriver.py 28 Sep 2002 03:44:15 -0000 1.18 *************** *** 62,66 **** return self ! def display(self, WIDTH=60): from math import sqrt if self.n > 0: --- 62,66 ---- return self ! def display(self, WIDTH=61): from math import sqrt if self.n > 0: *************** *** 81,85 **** ndigits = len(str(biggest)) ! format = "%6.2f %" + str(ndigits) + "d" for i in range(len(self.buckets)): --- 81,85 ---- ndigits = len(str(biggest)) ! format = "%5.1f %" + str(ndigits) + "d" for i in range(len(self.buckets)): From tim_one@users.sourceforge.net Sat Sep 28 08:41:16 2002 From: tim_one@users.sourceforge.net (Tim Peters) Date: Sat, 28 Sep 2002 00:41:16 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.36,1.37 classifier.py,1.23,1.24 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv12459 Modified Files: Options.py classifier.py Log Message: New option [Classifier] count_duplicates_only_once_in_training: False Please try it on your data with True. Because it decreases both ham and spam mean scores, you'll probably need a smaller spam_cutoff value too. Various biases in the Graham scheme made this a loser there, but it may be better under the Robinson scheme. Something I haven't tried: a smaller value of robinson_probability_s *may* also help when this is enabled (then again, it may hurt too ...). Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.36 retrieving revision 1.37 diff -C2 -d -r1.36 -r1.37 *** Options.py 28 Sep 2002 03:41:10 -0000 1.36 --- Options.py 28 Sep 2002 07:41:13 -0000 1.37 *************** *** 199,202 **** --- 199,213 ---- robinson_minimum_prob_strength: 0.1 + # There's a strange asymmetry in the scheme, where multiple occurrences of + # a word in a msg are ignored during scoring, but all add to the spamcount + # (or hamcount) during training. This imbalance couldn't be altered without + # hurting results under the Graham scheme, but it may well be better to + # treat things the same way during training under the Robinson schems. Set + # this to true to try that. + # NOTE: In Tim's tests this decreased both the ham and spam mean scores, + # the former more than the latter. Therefore you'll probably want a smaller + # spam_cutoff value when this is enabled. + count_duplicates_only_once_in_training: False + ########################################################################### # Speculative options for Gary Robinson's central-limit ideas. These may go *************** *** 257,260 **** --- 268,272 ---- 'robinson_probability_s': float_cracker, 'robinson_minimum_prob_strength': float_cracker, + 'count_duplicates_only_once_in_training': boolean_cracker, 'use_central_limit': boolean_cracker, Index: classifier.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/classifier.py,v retrieving revision 1.23 retrieving revision 1.24 diff -C2 -d -r1.23 -r1.24 *** classifier.py 28 Sep 2002 03:41:10 -0000 1.23 --- classifier.py 28 Sep 2002 07:41:13 -0000 1.24 *************** *** 282,285 **** --- 282,287 ---- wordinfoget = wordinfo.get now = time.time() + if options.count_duplicates_only_once_in_training: + wordstream = Set(wordstream) for word in wordstream: record = wordinfoget(word) *************** *** 304,307 **** --- 306,311 ---- wordinfoget = self.wordinfo.get + if options.count_duplicates_only_once_in_training: + wordstream = Set(wordstream) for word in wordstream: record = wordinfoget(word) From gvanrossum@users.sourceforge.net Sat Sep 28 15:39:13 2002 From: gvanrossum@users.sourceforge.net (Guido van Rossum) Date: Sat, 28 Sep 2002 07:39:13 -0700 Subject: [Spambayes-checkins] spambayes README.txt,1.28,1.29 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv4475 Modified Files: README.txt Log Message: Clarify test data setup. Index: README.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/README.txt,v retrieving revision 1.28 retrieving revision 1.29 diff -C2 -d -r1.28 -r1.29 *** README.txt 25 Sep 2002 02:09:52 -0000 1.28 --- README.txt 28 Sep 2002 14:39:11 -0000 1.29 *************** *** 210,213 **** --- 210,217 ---- reservoir/ (contains "backup ham") + Every file at the deepest level is used (not just files with .txt + extenstions). Every file should have a "Unix From" header before the + RFC-822 message (i.e. a line of the form "From
"). + If you use the same names and structure, huge mounds of the tedious testing code will work as-is. The more Set directories the merrier, although you From nascheme@users.sourceforge.net Sat Sep 28 19:48:33 2002 From: nascheme@users.sourceforge.net (Neil Schemenauer) Date: Sat, 28 Sep 2002 11:48:33 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.37,1.38 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv8755 Modified Files: Options.py Log Message: Remove mine_message_ids option since it shouldn't hurt to always have it enabled. Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.37 retrieving revision 1.38 diff -C2 -d -r1.37 -r1.38 *** Options.py 28 Sep 2002 07:41:13 -0000 1.37 --- Options.py 28 Sep 2002 18:48:31 -0000 1.38 *************** *** 93,99 **** mine_received_headers: False - # If set, the Message-Id is broken down into, hopefully, useful evidence. - mine_message_ids: False - [TestDriver] # These control various displays in class TestDriver.Driver, and Tester.Test. --- 93,96 ---- *************** *** 238,242 **** 'count_all_header_lines': boolean_cracker, 'mine_received_headers': boolean_cracker, - 'mine_message_ids': boolean_cracker, 'check_octets': boolean_cracker, 'octet_prefix_size': int_cracker, --- 235,238 ---- From nascheme@users.sourceforge.net Sat Sep 28 19:48:54 2002 From: nascheme@users.sourceforge.net (Neil Schemenauer) Date: Sat, 28 Sep 2002 11:48:54 -0700 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.41,1.42 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv8826 Modified Files: tokenizer.py Log Message: Remove mine_message_ids option since it shouldn't hurt to always have it enabled. Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.41 retrieving revision 1.42 diff -C2 -d -r1.41 -r1.42 *** tokenizer.py 27 Sep 2002 04:06:12 -0000 1.41 --- tokenizer.py 28 Sep 2002 18:48:52 -0000 1.42 *************** *** 984,996 **** yield 'received:' + tok ! if options.mine_message_ids: ! msgid = msg.get("message-id", "") ! m = message_id_re.match(msgid) ! if not m: ! # might be weird instead of invalid but who cares? ! yield 'message-id:invalid' ! else: ! # looks okay, return the hostname only ! yield 'message-id:@%s' % m.group(1) # As suggested by Anthony Baxter, merely counting the number of --- 984,997 ---- yield 'received:' + tok ! # Message-Id: This seems to be a small win and should no ! # adversely affect a mixed source corpus so it's always enabled. ! msgid = msg.get("message-id", "") ! m = message_id_re.match(msgid) ! if m: ! # looks okay, return the hostname ! yield 'message-id:@%s' % m.group(1) ! else: ! # might be weird instead of invalid but who cares? ! yield 'message-id:invalid' # As suggested by Anthony Baxter, merely counting the number of From gvanrossum@users.sourceforge.net Sat Sep 28 19:50:53 2002 From: gvanrossum@users.sourceforge.net (Guido van Rossum) Date: Sat, 28 Sep 2002 11:50:53 -0700 Subject: [Spambayes-checkins] spambayes README.txt,1.29,1.30 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv9407 Modified Files: README.txt Log Message: Clarify Unix From lines in tests messages -- they're optional. Index: README.txt =================================================================== RCS file: /cvsroot/spambayes/spambayes/README.txt,v retrieving revision 1.29 retrieving revision 1.30 diff -C2 -d -r1.29 -r1.30 *** README.txt 28 Sep 2002 14:39:11 -0000 1.29 --- README.txt 28 Sep 2002 18:50:51 -0000 1.30 *************** *** 133,137 **** =================== cleanarch ! A script to repair mbox archives by finding "From" lines that should have been escaped, and escaping them. --- 133,137 ---- =================== cleanarch ! A script to repair mbox archives by finding "Unix From" lines that should have been escaped, and escaping them. *************** *** 211,216 **** Every file at the deepest level is used (not just files with .txt ! extenstions). Every file should have a "Unix From" header before the ! RFC-822 message (i.e. a line of the form "From
"). If you use the same names and structure, huge mounds of the tedious testing --- 211,217 ---- Every file at the deepest level is used (not just files with .txt ! extenstions). The files may bot don't need to have a "Unix From" ! header before the RFC-822 message (i.e. a line of the form "From !
"). If you use the same names and structure, huge mounds of the tedious testing From richiehindle@users.sourceforge.net Sat Sep 28 23:24:25 2002 From: richiehindle@users.sourceforge.net (Richie Hindle) Date: Sat, 28 Sep 2002 15:24:25 -0700 Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.4,1.5 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv3162 Modified Files: pop3proxy.py Log Message: Improved the timeout code to cope with long delays from the real POP3 server (having an ISP with dodgy POP3 servers is really helping to improve the robustness of pop3proxy.py - I should really add Demon Internet to the credits). Prevented the self-test code from printing the X-Hammie-Disposition headers, because under the ultra-simple test case they come out as No for both the test ham and the test spam. That doesn't matter because it's only their existence that's being tested for, but a casual observer might think something was broken. Index: pop3proxy.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** pop3proxy.py 23 Sep 2002 21:20:10 -0000 1.4 --- pop3proxy.py 28 Sep 2002 22:24:22 -0000 1.5 *************** *** 114,118 **** return len(args) == 0 else: ! # Assume that unknown commands will get an error response. return False --- 114,119 ---- return len(args) == 0 else: ! # Assume that an unknown command will get a single-line ! # response. This should work for errors and for POP-AUTH. return False *************** *** 121,134 **** (response, isClosing, timedOut). isClosing is True if the server closes the socket, which tells found_terminator() to ! close when the response has been sent. timedOut is set if the ! request was still arriving after 30 seconds, and tells ! found_terminator() to proxy the remainder of the response. """ ! isClosing = False ! timedOut = False startTime = time.time() isMulti = self.isMultiline(command, args) ! responseLines = [] isFirstLine = True while True: line = self.serverFile.readline() --- 122,136 ---- (response, isClosing, timedOut). isClosing is True if the server closes the socket, which tells found_terminator() to ! close when the response has been sent. timedOut is set if a ! TOP or RETR request was still arriving after 30 seconds, and ! tells found_terminator() to proxy the remainder of the response. """ ! responseLines = [] startTime = time.time() isMulti = self.isMultiline(command, args) ! isClosing = False ! timedOut = False isFirstLine = True + seenAllHeaders = False while True: line = self.serverFile.readline() *************** *** 148,155 **** # A normal line - append it to the response and carry on. responseLines.append(line) ! # Time out after 30 seconds - found_terminator() knows how # to deal with this. ! if time.time() > startTime + 30: timedOut = True break --- 150,160 ---- # A normal line - append it to the response and carry on. responseLines.append(line) + seenAllHeaders = seenAllHeaders or line in ['\r\n', '\n'] ! # Time out after 30 seconds for message-retrieval commands ! # if all the headers are down - found_terminator() knows how # to deal with this. ! if command in ['TOP', 'RETR'] and \ ! seenAllHeaders and time.time() > startTime + 30: timedOut = True break *************** *** 544,548 **** response = proxy.recv(100) count, totalSize = map(int, response.split()[1:3]) - print "%d messages in test mailbox" % count assert count == 2 --- 549,552 ---- *************** *** 554,562 **** while response.find('\n.\r\n') == -1: response = response + proxy.recv(1000) ! headerOffset = response.find(hammie.DISPHEADER) ! assert headerOffset != -1 ! headerEnd = headerOffset + len(HEADER_EXAMPLE) ! header = response[headerOffset:headerEnd].strip() ! print "Message %d: %s" % (i, header) # Kill the proxy and the test server. --- 558,562 ---- while response.find('\n.\r\n') == -1: response = response + proxy.recv(1000) ! assert response.find(hammie.DISPHEADER) != -1 # Kill the proxy and the test server. From nascheme@users.sourceforge.net Sun Sep 29 05:14:39 2002 From: nascheme@users.sourceforge.net (Neil Schemenauer) Date: Sat, 28 Sep 2002 21:14:39 -0700 Subject: [Spambayes-checkins] spambayes tokenizer.py,1.42,1.43 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv11632 Modified Files: tokenizer.py Log Message: Mine the To and Cc headers. This another definite win for me. I'm sure about the log2 trick but it seems to work okay. Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.42 retrieving revision 1.43 diff -C2 -d -r1.42 -r1.43 *** tokenizer.py 28 Sep 2002 18:48:52 -0000 1.42 --- tokenizer.py 29 Sep 2002 04:14:36 -0000 1.43 *************** *** 8,11 **** --- 8,12 ---- import email.Errors import re + import math from sets import Set *************** *** 771,774 **** --- 772,778 ---- yield '.'.join(parts[:i]) + def log2(n, log=math.log, c=math.log(2)): + return log(n)/c + uuencode_begin_re = re.compile(r""" ^begin \s+ *************** *** 963,966 **** --- 967,980 ---- for t in tokenize_word(w): yield prefix + t + + # To: + # Cc: + # Count the number of addresses in each of the recipient headers. + for field in ('to', 'cc'): + count = 0 + for addrs in msg.get_all(field, []): + count += len(addrs.split(',')) + if count > 0: + yield '%s:2**%d' % (field, round(log2(count))) # These headers seem to work best if they're not tokenized: just From tim.one@comcast.net Sun Sep 29 18:00:05 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 29 Sep 2002 13:00:05 -0400 Subject: [Spambayes-checkins] Checkin notification is hosed Message-ID: SourceForge apparently can't connect to python.org: Checking in rebal.py; /cvsroot/spambayes/spambayes/rebal.py,v <-- rebal.py new revision: 1.8; previous revision: 1.7 done Mailing spambayes-checkins@python.org... Generating notification message... Generating notification message... done. Mailing spambayes-checkins@python.org... Generating notification message... Traceback (innermost last): File "/cvsroot/spambayes/CVSROOT/syncmail", line 336, in ? main() File "/cvsroot/spambayes/CVSROOT/syncmail", line 329, in main blast_mail(subject, people, specs[1:], contextlines, fromhost) File "/cvsroot/spambayes/CVSROOT/syncmail", line 227, in blast_mail conn.connect(MAILHOST, MAILPORT) File "/usr/lib/python1.5/smtplib.py", line 216, in connect self.sock.connect(host, port) socket.error: (111, 'Connection refused') rebal.py now has a -d (dry run) option: If you specify -d, rebal will display how many files it's going to move, from where and to where, but won't actually move anything. From tim.one@comcast.net Sun Sep 29 19:08:05 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 29 Sep 2002 14:08:05 -0400 Subject: [Spambayes-checkins] RE: [Spambayes] On counting words more than once In-Reply-To: <200209291437.g8TEbf809551@pcp02138704pcs.reston01.va.comcast.net> Message-ID: SF still isn't able to mail checkin notifications. Because Neil, Guido and I all reported improvement via counting duplicate words (within a message) only once during training, I removed the recent option for trying this, and we do this all the time now. The checkin comment is below. Note that you may need to change spam_cutoff! """ Removed option count_duplicates_only_once_in_training: this is always done now. Counting duplicate words in a msg more than once during training appears to have been helpful under the Graham scheme only because it acted to counteract other biases. Under Robinson's unbiased scheme, results improve by counting duplicates only once during training (just as duplicates are counted only once during scoring), the ham score mean decreases significantly and consistently, likewise ham score variance, the spam score mean decreases consistently (but less than the ham mean decreased, so the spread increases), and spam score variance increaeses. That implies there's *some* value to be gotten out of knowing how often a word appears in a msg, but that distorting spamprob isn't the right way to exploit it. WordInfo.hamcount now has a different meaning: it's the number of hams in which the word appears, instead of the number of times the word appears across all ham. Likewise for WordInfo.spamcount. Note that because both mean scores decreased, you'll probably want a smaller spam_cutoff value now. The default spam_cutoff has been changed from 0.57 to 0.56. But this is corpus-dependent, so be sure to tune your value for your corpus. """ From tim.one@comcast.net Sun Sep 29 21:34:58 2002 From: tim.one@comcast.net (Tim Peters) Date: Sun, 29 Sep 2002 16:34:58 -0400 Subject: [Spambayes-checkins] Another change In-Reply-To: Message-ID: Change checked in to tokenizer.py: tokenize_headers(): Based on a silly experiment that *only* tokenized Subject lines, added a gimmick here to generate tokens for runs of punctuation characters (\W+) in subject lines. -> tested 2000 hams & 1400 spams against 18000 hams & 12600 spams [ditto 19 times] false positive percentages 0.050 0.000 won -100.00% 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.050 0.050 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.050 0.050 tied won 1 times tied 9 times lost 0 times total unique fp went from 3 to 2 won -33.33% mean fp % went from 0.015 to 0.01 won -33.33% false negative percentages 0.071 0.071 tied 0.071 0.071 tied 0.000 0.000 tied 0.143 0.143 tied 0.143 0.143 tied 0.214 0.214 tied 0.143 0.143 tied 0.143 0.143 tied 0.214 0.214 tied 0.000 0.000 tied won 0 times tied 10 times lost 0 times total unique fn went from 16 to 16 tied mean fn % went from 0.114285714286 to 0.114285714286 tied ham mean ham sdev 25.74 25.65 -0.35% 5.74 5.67 -1.22% 25.69 25.61 -0.31% 5.56 5.50 -1.08% 25.64 25.57 -0.27% 5.74 5.67 -1.22% 25.74 25.66 -0.31% 5.61 5.54 -1.25% 25.50 25.42 -0.31% 5.78 5.72 -1.04% 25.58 25.51 -0.27% 5.44 5.39 -0.92% 25.73 25.65 -0.31% 5.63 5.59 -0.71% 25.69 25.61 -0.31% 5.47 5.41 -1.10% 25.92 25.84 -0.31% 5.54 5.48 -1.08% 25.90 25.81 -0.35% 5.88 5.81 -1.19% ham mean and sdev for all runs 25.71 25.63 -0.31% 5.64 5.58 -1.06% spam mean spam sdev 84.07 83.86 -0.25% 7.10 7.09 -0.14% 83.83 83.64 -0.23% 6.84 6.83 -0.15% 83.46 83.27 -0.23% 6.80 6.81 +0.15% 84.03 83.82 -0.25% 6.88 6.88 +0.00% 84.08 83.89 -0.23% 6.68 6.65 -0.45% 83.96 83.78 -0.21% 6.99 6.96 -0.43% 83.62 83.42 -0.24% 6.84 6.82 -0.29% 84.04 83.86 -0.21% 6.71 6.71 +0.00% 84.08 83.88 -0.24% 7.01 6.98 -0.43% 83.97 83.75 -0.26% 6.65 6.65 +0.00% spam mean and sdev for all runs 83.91 83.72 -0.23% 6.85 6.84 -0.15% ham/spam mean difference: 58.20 58.09 -0.11 This is consistent but weak. Staring at the false negatives shows that it's moving them "in the right direction", though, and histogram analysis says something stronger: -> best cutoff for all runs: 0.55 -> with weighted total 10*2 fp + 11 fn = 31 -> fp rate 0.01% fn rate 0.0786% That is, if I had run at spam_cutoff 0.55 instead of 0.56, it would have been a pure win, leaving f-p alone but dropping 5(!) of the f-n. From anthonybaxter@users.sourceforge.net Mon Sep 30 05:02:33 2002 From: anthonybaxter@users.sourceforge.net (Anthony Baxter) Date: Sun, 29 Sep 2002 21:02:33 -0700 Subject: [Spambayes-checkins] website related.ht,1.1.1.1,1.2 Message-ID: Update of /cvsroot/spambayes/website In directory usw-pr-cvs1:/tmp/cvs-serv4280 Modified Files: related.ht Log Message: added PASP. Index: related.ht =================================================================== RCS file: /cvsroot/spambayes/website/related.ht,v retrieving revision 1.1.1.1 retrieving revision 1.2 diff -C2 -d -r1.1.1.1 -r1.2 *** related.ht 19 Sep 2002 08:40:55 -0000 1.1.1.1 --- related.ht 30 Sep 2002 04:02:31 -0000 1.2 *************** *** 11,14 **** --- 11,15 ----
Eric Raymond's bogofilter, a C code bayesian filter.
ifile, a Naive Bayes classification system. +
PASP, the Python Anti-Spam Proxy - a POP3 proxy for filtering email. Also uses Bayesian-ish classification.
...

From richiehindle@users.sourceforge.net Mon Sep 30 21:13:42 2002 From: richiehindle@users.sourceforge.net (Richie Hindle) Date: Mon, 30 Sep 2002 13:13:42 -0700 Subject: [Spambayes-checkins] spambayes pop3proxy.py,1.5,1.6 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv32034 Modified Files: pop3proxy.py Log Message: Use options.spam_cutoff instead of hammie.SPAM_THRESHOLD - the latter is far too high under the new default scoring scheme (I've sent a separate heads-up to Neale about this). Index: pop3proxy.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** pop3proxy.py 28 Sep 2002 22:24:22 -0000 1.5 --- pop3proxy.py 30 Sep 2002 20:13:39 -0000 1.6 *************** *** 37,40 **** --- 37,41 ---- import socket, asyncore, asynchat import classifier, tokenizer, hammie + from Options import options HEADER_FORMAT = '%s: %%s\r\n' % hammie.DISPHEADER *************** *** 344,348 **** # it's been classified. prob = self.bayes.spamprob(tokenizer.tokenize(message)) ! if prob >= hammie.SPAM_THRESHOLD: disposition = "Yes" else: --- 345,349 ---- # it's been classified. prob = self.bayes.spamprob(tokenizer.tokenize(message)) ! if prob > options.spam_cutoff: disposition = "Yes" else: From montanaro@users.sourceforge.net Mon Sep 30 22:56:29 2002 From: montanaro@users.sourceforge.net (Skip Montanaro) Date: Mon, 30 Sep 2002 14:56:29 -0700 Subject: [Spambayes-checkins] spambayes Options.py,1.39,1.40 tokenizer.py,1.45,1.46 Message-ID: Update of /cvsroot/spambayes/spambayes In directory usw-pr-cvs1:/tmp/cvs-serv8971 Modified Files: Options.py tokenizer.py Log Message: allow users to disable the long word skip tokens (e.g "skip:c 70") under the assumption that people who do receive mail which contains attachements will be penalized. Index: Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/Options.py,v retrieving revision 1.39 retrieving revision 1.40 diff -C2 -d -r1.39 -r1.40 *** Options.py 29 Sep 2002 18:03:39 -0000 1.39 --- Options.py 30 Sep 2002 21:56:27 -0000 1.40 *************** *** 93,96 **** --- 93,102 ---- mine_received_headers: False + # If your ham corpus is generated from sources which contain few, if any + # attachments you probably want to leave this alone. If you have many + # legitimate correspondents who send you attachments (Excel spreadsheets, + # etc), you might want to set this to False. + generate_long_skips: True + [TestDriver] # These control various displays in class TestDriver.Driver, and Tester.Test. *************** *** 223,226 **** --- 229,233 ---- 'safe_headers': ('get', lambda s: Set(s.split())), 'count_all_header_lines': boolean_cracker, + 'generate_long_skips': boolean_cracker, 'mine_received_headers': boolean_cracker, 'check_octets': boolean_cracker, Index: tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v retrieving revision 1.45 retrieving revision 1.46 diff -C2 -d -r1.45 -r1.46 *** tokenizer.py 29 Sep 2002 20:20:57 -0000 1.45 --- tokenizer.py 30 Sep 2002 21:56:27 -0000 1.46 *************** *** 645,649 **** # XXX Figure out why, and/or see if some other way of summarizing # XXX this info has greater benefit. ! yield "skip:%c %d" % (word[0], n // 10 * 10) if has_highbit_char(word): hicount = 0 --- 645,650 ---- # XXX Figure out why, and/or see if some other way of summarizing # XXX this info has greater benefit. ! if options.generate_long_skips: ! yield "skip:%c %d" % (word[0], n // 10 * 10) if has_highbit_char(word): hicount = 0 From tim.one@comcast.net Mon Sep 30 23:07:04 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 30 Sep 2002 18:07:04 -0400 Subject: [Spambayes-checkins] spambayes Options.py,1.39,1.40tokenizer.py,1.45,1.46 In-Reply-To: Message-ID: [Skip Montanaro] > allow users to disable the long word skip tokens (e.g "skip:c > 70") under the assumption that people who do receive mail which > contains attachements will be penalized. Skip, what is your reasoning here? We ignore attachments entirely unless they have text/* type. I don't see what skip tokens have to do with this. Besides, I named those tokens after you .