[Spambayes-checkins] spambayes README.txt,1.16,1.17 TestDriver.py,1.3,1.4 cmp.py,1.7,1.8 mboxtest.py,1.4,1.5 rates.py,1.3,1.4 timcv.py,1.3,1.4timtest.py,1.25,1.26

Fri, 13 Sep 2002 17:03:53 -0700

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv23741

Modified Files:
	README.txt TestDriver.py cmp.py mboxtest.py rates.py timcv.py 
	timtest.py 
Log Message:
Lots of small changes to support N-fold cross validation properly.
timcv.py now does this.  The pragmatic problem with giant pickle memos
and giant deepcopy memos is gone -- instead the test driver has to
take more care to train and untrain appropriate pieces explicitly.
This is actually easy (see timcv).

TestDriver.Driver now prints statistics with a recognizable pattern
at the start of the line, so that rates.py doesn't feel so arbitrary
anymore.  rates.py and cmp.py were changed accordingly.  rates.py now
puts a lot more stuff in the summary, including accounts of how many ham
and spam were trained on, and predicted against, in each test run.
Driver() clients have to explictly tell Driver when they want a new
classifier now; I changed timtest and mboxtest to do that, but am
not set up to exercise mboxtest.

Driver, rates and cmp no longer make assumptions about the *kind* of
test being run, and work equally well for, e.g., NxN grids or N-fold
c-v.

rates.py also computes the average f-p and f-n rates now, and cmp.py
displays before-and-after values for those too.  Average rates are
intended to be used when doing N-fold c-v; they make less sense
for an NxN test grid.

Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.16
retrieving revision 1.17
diff -C2 -d -r1.16 -r1.17
*** README.txt	13 Sep 2002 19:33:04 -0000	1.16
--- README.txt	14 Sep 2002 00:03:51 -0000	1.17
***************
*** 14,18 ****
  later -- as is, the false positive rate has gotten too small to measure
  reliably across test sets with 4000 hams + 2750 spams, but the false
! negative rate is still over 1%.

  The code here depends in various ways on the latest Python from CVS
--- 14,19 ----
  later -- as is, the false positive rate has gotten too small to measure
  reliably across test sets with 4000 hams + 2750 spams, but the false
! negative rate is still over 1%.  Later:  the f-n rate has also gotten
! too small to measure reliably across that much training data.

  The code here depends in various ways on the latest Python from CVS
***************
*** 47,56 ****

  TestDriver.py
!     A higher layer of test helpers, building on Tester above.  It's
!     quite usable as-is for building simple test drivers, and more
!     complicated ones up to NxN test grids.  It's in the process of being
!     extended to allow easy building of N-way cross validation drivers
!     (the trick to that is doing so efficiently).  See also rates.py
!     and cmp.py below.

--- 48,55 ----

  TestDriver.py
!     A flexible higher layer of test helpers, building on Tester above.
!     For example, it's usable for building simple test drivers, NxN test
!     grids, and N-fold cross validation drivers.  See also rates.py and
!     cmp.py below.

***************
*** 71,75 ****
      A concrete test driver like mboxtest.py, but working with "a
      standard" test data setup (see below) rather than the specialized
!     mboxtest setup.

  timcv.py
--- 70,74 ----
      A concrete test driver like mboxtest.py, but working with "a
      standard" test data setup (see below) rather than the specialized
!     mboxtest setup.  This runs an NxN test grid, skipping the diagonal.

  timcv.py
***************
*** 82,92 ****
  ==============
  rates.py
!     Scans the output (so far) from timtest.py, and captures summary
!     statistics.

  cmp.py
      Given two summary files produced by rates.py, displays an account
      of all the f-p and f-n rates side-by-side, along with who won which
!     (etc), and the change in total # of f-ps and f-n.

--- 81,92 ----
  ==============
  rates.py
!     Scans the output (so far) produced by TestDriver.Drive(), and captures
!     summary statistics.

  cmp.py
      Given two summary files produced by rates.py, displays an account
      of all the f-p and f-n rates side-by-side, along with who won which
!     (etc), the change in total # of unique false positives and negatives,
!     and the change in average f-p and f-n rates.

***************
*** 127,136 ****
  Standard Test Data Setup
  ========================
- [Caution:  I'm going to switch this to support N-way cross validation,
-  instead of an NxN test grid.  The only effect on the directory structure
-  here is that you'll want more directories with fewer msgs in each
-  (splitting the data at random into 10 pairs should work very well).
- ]
- 
  Barry gave me mboxes, but the spam corpus I got off the web had one spam
  per file, and it only took two days of extreme pain to realize that one msg
--- 127,130 ----
***************
*** 142,145 ****
--- 136,142 ----

  The directory structure under my spambayes directory looks like so:
+ [But due to a better testing infrastructure, I'm going to spread this
+  across 20 subdirectories under Spam and under Ham, and use groups
+  of 10 for 10-fold cross validation]

  Data/
***************
*** 159,167 ****

  If you use the same names and structure, huge mounds of the tedious testing
! code will work as-is.  The more Set directories the merrier, although
! you'll hit a point of diminishing returns if you exceed 10.  The "reservoir"
! directory contains a few thousand other random hams.  When a ham is found
! that's really spam, I delete it, and then the rebal.py utility moves in a
! message at random from the reservoir to replace it.  If I had it to do over
  again, I think I'd move such spam into a Spam set (chosen at random),
  instead of deleting it.
--- 156,164 ----

  If you use the same names and structure, huge mounds of the tedious testing
! code will work as-is.  The more Set directories the merrier, although you
! want at least a few hundred messages in each one.  The "reservoir" directory
! contains a few thousand other random hams.  When a ham is found that's
! really spam, I delete it, and then the rebal.py utility moves in a message
! at random from the reservoir to replace it.  If I had it to do over
  again, I think I'd move such spam into a Spam set (chosen at random),
  instead of deleting it.
***************
*** 172,176 ****
      <http://www.em.ca/~bruceg/spam/>

! The sets are grouped into 5 pairs in the obvious way:  Spam/Set1 with
  Ham/Set1, and so on.  For each such pair, timtest trains a classifier on
  that pair, then runs predictions on each of the other 4 pairs.  In effect,
--- 169,173 ----
      <http://www.em.ca/~bruceg/spam/>

! The sets are grouped into pairs in the obvious way:  Spam/Set1 with
  Ham/Set1, and so on.  For each such pair, timtest trains a classifier on
  that pair, then runs predictions on each of the other 4 pairs.  In effect,
***************
*** 178,179 ****
--- 175,186 ----
  to avoid predicting against the same set trained on, except that it
  takes more time and seems the least interesting thing to try.
+ 
+ Later, support for N-fold cross validation testing was added, which allows
+ more accurate measurement of error rates with smaller amounts of training
+ data.  That's recommended now.
+ 
+ CAUTION:  The parititioning of your corpora across directories should
+ be random.  If it isn't, bias creeps in to the test results.  This is
+ usually screamingly obvious under the NxN grid method (rates vary by a
+ factor of 10 or more across training sets, and even within runs against
+ a single training set), but harder to spot using N-fold c-v.

Index: TestDriver.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** TestDriver.py	13 Sep 2002 19:33:04 -0000	1.3
--- TestDriver.py	14 Sep 2002 00:03:51 -0000	1.4
***************
*** 1,11 ****
  # Loop:
! #     # Set up a new base classifier for testing.
! #     train(ham, spam)
  #     # Run tests against (possibly variants of) this classifier.
  #     Loop:
! #         Optional:
! #             # Forget training for some subset of ham and spam.  This
! #             # works against the base classifier trained at the start.
! #             forget(ham, spam)
  #         # Predict against other data.
  #         Loop:
--- 1,15 ----
  # Loop:
! #     Optional:
! #         # Set up a new base classifier for testing.
! #         new_classifier()
  #     # Run tests against (possibly variants of) this classifier.
  #     Loop:
! #         Loop:
! #             Optional:
! #                 # train on more ham and spam
! #                 train(ham, spam)
! #             Optional:
! #                 # Forget training for some subset of ham and spam.
! #                 untrain(ham, spam)
  #         # Predict against other data.
  #         Loop:
***************
*** 89,121 ****
          self.global_spam_hist = Hist(options.nbuckets)
          self.ntimes_finishtest_called = 0

!     def train(self, ham, spam):
          c = self.classifier = classifier.GrahamBayes()
!         t = self.tester = Tester.Test(c)
! 
!         print "Training on", ham, "&", spam, "...",
!         t.train(ham, spam)
!         print c.nham, "hams &", c.nspam, "spams"
! 
          self.trained_ham_hist = Hist(options.nbuckets)
          self.trained_spam_hist = Hist(options.nbuckets)

!     def forget(self, ham, spam):
!         import copy
! 
!         print "    forgetting", ham, "&", spam, "...",
          c = self.classifier
          nham, nspam = c.nham, c.nspam
!         c = copy.deepcopy(c)
!         self.tester.set_classifier(c)

          self.tester.untrain(ham, spam)
          print nham - c.nham, "hams &", nspam - c.nspam, "spams"

-         self.global_ham_hist += self.trained_ham_hist
-         self.global_spam_hist += self.trained_spam_hist
-         self.trained_ham_hist = Hist(options.nbuckets)
-         self.trained_spam_hist = Hist(options.nbuckets)
- 
      def finishtest(self):
          if options.show_histograms:
--- 93,118 ----
          self.global_spam_hist = Hist(options.nbuckets)
          self.ntimes_finishtest_called = 0
+         self.new_classifier()

!     def new_classifier(self):
          c = self.classifier = classifier.GrahamBayes()
!         self.tester = Tester.Test(c)
          self.trained_ham_hist = Hist(options.nbuckets)
          self.trained_spam_hist = Hist(options.nbuckets)

!     def train(self, ham, spam):
!         print "-> Training on", ham, "&", spam, "...",
          c = self.classifier
          nham, nspam = c.nham, c.nspam
!         self.tester.train(ham, spam)
!         print c.nham - nham, "hams &", c.nspam- nspam, "spams"

+     def untrain(self, ham, spam):
+         print "-> Forgetting", ham, "&", spam, "...",
+         c = self.classifier
+         nham, nspam = c.nham, c.nspam
          self.tester.untrain(ham, spam)
          print nham - c.nham, "hams &", nspam - c.nspam, "spams"

      def finishtest(self):
          if options.show_histograms:
***************
*** 124,127 ****
--- 121,126 ----
          self.global_ham_hist += self.trained_ham_hist
          self.global_spam_hist += self.trained_spam_hist
+         self.trained_ham_hist = Hist(options.nbuckets)
+         self.trained_spam_hist = Hist(options.nbuckets)

          self.ntimes_finishtest_called += 1
***************
*** 163,177 ****

          t.reset_test_results()
!         print "    testing against", ham, "&", spam, "...",
          t.predict(spam, True, new_spam)
          t.predict(ham, False, new_ham)
!         print t.nham_tested, "hams &", t.nspam_tested, "spams"

!         print "    false positive:", t.false_positive_rate()
!         print "    false negative:", t.false_negative_rate()

          newfpos = Set(t.false_positives()) - self.falsepos
          self.falsepos |= newfpos
!         print "    new false positives:", [e.tag for e in newfpos]
          if not options.show_false_positives:
              newfpos = ()
--- 162,179 ----

          t.reset_test_results()
!         print "-> Predicting", ham, "&", spam, "..."
          t.predict(spam, True, new_spam)
          t.predict(ham, False, new_ham)
!         print "-> <stat> tested", t.nham_tested, "hams &", t.nspam_tested, \
!               "spams against", c.nham, "hams &", c.nspam, "spams"

!         print "-> <stat> false positive %:", t.false_positive_rate()
!         print "-> <stat> false negative %:", t.false_negative_rate()

          newfpos = Set(t.false_positives()) - self.falsepos
          self.falsepos |= newfpos
!         print "-> <stat> %d new false positives" % len(newfpos)
!         if newfpos:
!             print "    new fp:", [e.tag for e in newfpos]
          if not options.show_false_positives:
              newfpos = ()
***************
*** 183,187 ****
          newfneg = Set(t.false_negatives()) - self.falseneg
          self.falseneg |= newfneg
!         print "    new false negatives:", [e.tag for e in newfneg]
          if not options.show_false_negatives:
              newfneg = ()
--- 185,191 ----
          newfneg = Set(t.false_negatives()) - self.falseneg
          self.falseneg |= newfneg
!         print "-> <stat> %d new false negatives" % len(newfneg)
!         if newfneg:
!             print "    new fn:", [e.tag for e in newfneg]
          if not options.show_false_negatives:
              newfneg = ()

Index: cmp.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/cmp.py,v
retrieving revision 1.7
retrieving revision 1.8
diff -C2 -d -r1.7 -r1.8
*** cmp.py	12 Sep 2002 19:35:14 -0000	1.7
--- cmp.py	14 Sep 2002 00:03:51 -0000	1.8
***************
*** 16,39 ****
  #   list of all f-n rates,
  #   total f-p,
! #   total f-n)
  # from summary file f.
  def suck(f):
      fns = []
      fps = []
      while 1:
!         line = f.readline()
          if line.startswith('total'):
              break
!         if not line.startswith('Training'):
!             # A line with an f-p rate and an f-n rate.
!             p, n = map(float, line.split())
!             fps.append(p)
!             fns.append(n)

!     # "total false pos 8 0.04"
!     # "total false neg 249 1.81090909091"
!     fptot = int(line.split()[-2])
!     fntot = int(f.readline().split()[-2])
!     return fps, fns, fptot, fntot

  def tag(p1, p2):
--- 16,49 ----
  #   list of all f-n rates,
  #   total f-p,
! #   total f-n,
! #   average f-p rate,
! #   average f-n rate)
  # from summary file f.
  def suck(f):
      fns = []
      fps = []
+     get = f.readline
      while 1:
!         line = get()
!         if line.startswith('-> <stat> tested'):
!             print line,
!         if line.startswith('-> '):
!             continue
          if line.startswith('total'):
              break
!         # A line with an f-p rate and an f-n rate.
!         p, n = map(float, line.split())
!         fps.append(p)
!         fns.append(n)

!     # "total unique false pos 0"
!     # "total unique false neg 0"
!     # "average fp % 0.0"
!     # "average fn % 0.0"
!     fptot = int(line.split()[-1])
!     fntot = int(get().split()[-1])
!     fpmean = float(get().split()[-1])
!     fnmean = float(get().split()[-1])
!     return fps, fns, fptot, fntot, fpmean, fnmean

  def tag(p1, p2):
***************
*** 60,72 ****
      print

- fp1, fn1, fptot1, fntot1 = suck(file(f1n + '.txt'))
- fp2, fn2, fptot2, fntot2 = suck(file(f2n + '.txt'))

  print f1n, '->', f2n

  print
  print "false positive percentages"
  dump(fp1, fp2)
  print "total unique fp went from", fptot1, "to", fptot2, tag(fptot1, fptot2)

  print
--- 70,84 ----
      print

  print f1n, '->', f2n

+ fp1, fn1, fptot1, fntot1, fpmean1, fnmean1 = suck(file(f1n + '.txt'))
+ fp2, fn2, fptot2, fntot2, fpmean2, fnmean2 = suck(file(f2n + '.txt'))
+ 
  print
  print "false positive percentages"
  dump(fp1, fp2)
  print "total unique fp went from", fptot1, "to", fptot2, tag(fptot1, fptot2)
+ print "mean fp % went from", fpmean1, "to", fpmean2, tag(fpmean1, fpmean2)

  print
***************
*** 74,75 ****
--- 86,88 ----
  dump(fn1, fn2)
  print "total unique fn went from", fntot1, "to", fntot2, tag(fntot1, fntot2)
+ print "mean fn % went from", fnmean1, "to", fnmean2, tag(fnmean1, fnmean2)

Index: mboxtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/mboxtest.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** mboxtest.py	13 Sep 2002 16:26:58 -0000	1.4
--- mboxtest.py	14 Sep 2002 00:03:51 -0000	1.5
***************
*** 166,169 ****
--- 166,170 ----

      for iham, ispam in testsets:
+         driver.new_classifier()
          driver.train(mbox(ham, iham), mbox(spam, ispam))
          for ihtest, istest in testsets:

Index: rates.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/rates.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** rates.py	12 Sep 2002 19:35:14 -0000	1.3
--- rates.py	14 Sep 2002 00:03:51 -0000	1.4
***************
*** 2,6 ****

  """
! rates.py basename

  Assuming that file
--- 2,6 ----

  """
! rates.py basename ...

  Assuming that file
***************
*** 19,38 ****
  """

- import re
  import sys

  """
! Training on Data/Ham/Set1 & Data/Spam/Set1 ... 4000 hams & 2750 spams
!     testing against Data/Ham/Set2 & Data/Spam/Set2 ... 4000 hams & 2750 spams
!     false positive: 0.025
!     false negative: 1.34545454545
!     new false positives: ['Data/Ham/Set2/66645.txt']
  """
- pat1 = re.compile(r'\s*Training on ').match
- pat2 = re.compile(r'\s+false (positive|negative): (.*)').match
- pat3 = re.compile(r"\s+new false (positives|negatives): \[(.+)\]").match

  def doit(basename):
      ifile = file(basename + '.txt')
      oname = basename + 's.txt'
      ofile = file(oname, 'w')
--- 19,38 ----
  """

  import sys

  """
! -> Training on Data/Ham/Set2-3 & Data/Spam/Set2-3 ... 8000 hams & 5500 spams
! -> Predicting Data/Ham/Set1 & Data/Spam/Set1 ...
! -> <stat> tested 4000 hams & 2750 spams against 8000 hams & 5500 spams
! -> <stat> false positive %: 0.025
! -> <stat> false negative %: 0.327272727273
! -> <stat> 1 new false positives
  """

  def doit(basename):
      ifile = file(basename + '.txt')
+     interesting = filter(lambda line: line.startswith('-> '), ifile)
+     ifile.close()
+ 
      oname = basename + 's.txt'
      ofile = file(oname, 'w')
***************
*** 44,83 ****
          print >> ofile, msg

!     nfn = nfp = 0
      ntrainedham = ntrainedspam = 0
!     for line in ifile:
!         "Training on Data/Ham/Set1 & Data/Spam/Set1 ... 4000 hams & 2750 spams"
!         m = pat1(line)
!         if m:
!             dump(line[:-1])
!             fields = line.split()
              ntrainedham += int(fields[-5])
              ntrainedspam += int(fields[-2])
              continue

!         "false positive: 0.025"
!         "false negative: 1.34545454545"
!         m = pat2(line)
!         if m:
!             kind, guts = m.groups()
!             guts = float(guts)
              if kind == 'positive':
!                 lastval = guts
              else:
!                 dump('    %7.3f %7.3f' % (lastval, guts))
              continue

!         "new false positives: ['Data/Ham/Set2/66645.txt']"
!         m = pat3(line)
!         if m:   # note that it doesn't match at all if the list is "[]"
!             kind, guts = m.groups()
!             n = len(guts.split())
              if kind == 'positives':
!                 nfp += n
              else:
!                 nfn += n

!     dump('total false pos', nfp, nfp * 1e2 / ntrainedham)
!     dump('total false neg', nfn, nfn * 1e2 / ntrainedspam)

  for name in sys.argv[1:]:
--- 44,91 ----
          print >> ofile, msg

!     ntests = nfn = nfp = 0
!     sumfnrate = sumfprate = 0.0
      ntrainedham = ntrainedspam = 0
! 
!     for line in interesting:
!         dump(line[:-1])
!         fields = line.split()
! 
!         # 0      1      2    3    4 5    6                 -5  -4 -3   -2    -1
!         #-> <stat> tested 4000 hams & 2750 spams against 8000 hams & 5500 spams
!         if line.startswith('-> <stat> tested '):
              ntrainedham += int(fields[-5])
              ntrainedspam += int(fields[-2])
+             ntests += 1
              continue

!         #  0      1     2        3
!         # -> <stat> false positive %: 0.025
!         # -> <stat> false negative %: 0.327272727273
!         if line.startswith('-> <stat> false '):
!             kind = fields[3]
!             percent = float(fields[-1])
              if kind == 'positive':
!                 sumfprate += percent
!                 lastval = percent
              else:
!                 sumfnrate += percent
!                 dump('    %7.3f %7.3f' % (lastval, percent))
              continue

!         #  0      1 2   3     4         5
!         # -> <stat> 1 new false positives
!         if fields[3] == 'new' and fields[4] == 'false':
!             kind = fields[-1]
!             count = int(fields[2])
              if kind == 'positives':
!                 nfp += count
              else:
!                 nfn += count

!     dump('total unique false pos', nfp)
!     dump('total unique false neg', nfn)
!     dump('average fp %', sumfprate / ntests)
!     dump('average fn %', sumfnrate / ntests)

  for name in sys.argv[1:]:

Index: timcv.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timcv.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** timcv.py	13 Sep 2002 20:35:37 -0000	1.3
--- timcv.py	14 Sep 2002 00:03:51 -0000	1.4
***************
*** 77,85 ****

      d = Driver()
!     # Train it on all the data.
!     d.train(MsgStream("%s-%d" % (hamdirs[0], nsets), hamdirs),
!             MsgStream("%s-%d" % (spamdirs[0], nsets), spamdirs))

!     # Now run nsets times, removing one pair per run.
      for i in range(nsets):
          h = hamdirs[i]
--- 77,85 ----

      d = Driver()
!     # Train it on all sets except the first.
!     d.train(MsgStream("%s-%d" % (hamdirs[1], nsets), hamdirs[1:]),
!             MsgStream("%s-%d" % (spamdirs[1], nsets), spamdirs[1:]))

!     # Now run nsets times, predicting pair i against all except pair i.
      for i in range(nsets):
          h = hamdirs[i]
***************
*** 87,93 ****
          hamstream = MsgStream(h, [h])
          spamstream = MsgStream(s, [s])
!         d.forget(hamstream, spamstream)
          d.test(hamstream, spamstream)
          d.finishtest()
      d.alldone()

--- 87,103 ----
          hamstream = MsgStream(h, [h])
          spamstream = MsgStream(s, [s])
! 
!         if i > 0:
!             # Forget this set.
!             d.untrain(hamstream, spamstream)
! 
!         # Predict this set.
          d.test(hamstream, spamstream)
          d.finishtest()
+ 
+         if i < nsets - 1:
+             # Add this set back in.
+             d.train(hamstream, spamstream)
+ 
      d.alldone()

Index: timtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtest.py,v
retrieving revision 1.25
retrieving revision 1.26
diff -C2 -d -r1.25 -r1.26
*** timtest.py	13 Sep 2002 18:48:42 -0000	1.25
--- timtest.py	14 Sep 2002 00:03:51 -0000	1.26
***************
*** 74,78 ****
          random.seed(hash(directory))
          random.shuffle(all)
!         for fname in all[-1500:-1000:]:
              yield Msg(directory, fname)

--- 74,78 ----
          random.seed(hash(directory))
          random.shuffle(all)
!         for fname in all[-1500:-1300:]:
              yield Msg(directory, fname)

***************
*** 89,92 ****
--- 89,93 ----
      d = Driver()
      for spamdir, hamdir in spamhamdirs:
+         d.new_classifier()
          d.train(MsgStream(hamdir), MsgStream(spamdir))
          for sd2, hd2 in spamhamdirs: