[Spambayes-checkins] spambayes Options.py,1.44,1.45 timcv.py,1.9,1.10

Wed, 09 Oct 2002 21:55:17 -0700

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv26404

Modified Files:
	Options.py timcv.py 
Log Message:
Adapted from a patch by T. Alexander Popiel, this adds new option (and
in a new section)

[CV Driver]
build_each_classifier_from_scratch: False

When True, a cross-validation driver can be used safely-- but more
slowly --with a central-limit test.  timcv.py pays attention to this.

Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.44
retrieving revision 1.45
diff -C2 -d -r1.44 -r1.45
*** Options.py	10 Oct 2002 00:23:51 -0000	1.44
--- Options.py	10 Oct 2002 04:55:15 -0000	1.45
***************
*** 173,176 ****
--- 173,190 ----
  ham_directories: Data/Ham/Set%d

+ [CV Driver]
+ # A cross-validation driver takes N ham+spam sets, and builds N classifiers,
+ # training each on N-1 sets, and the predicting against the set not trained
+ # on.  By default, it does this in a clever way, learning *and* unlearning
+ # sets as it goes along, so that it never needs to train on N-1 sets in one
+ # gulp after the first time.  However, that can't always be done:  in
+ # particular, the central-limit schemes can't unlearn incrementally, and can
+ # learn incrementally only via a form of cheating whose bad effects overall
+ # aren't yet known.
+ # So when desiring to run a central-limit test, set
+ # build_each_classifier_from_scratch to true.  This gives correct results,
+ # but runs much slower than a CV driver usually runs.
+ build_each_classifier_from_scratch: False
+ 
  [Classifier]
  # The maximum number of extreme words to look at in a msg, where "extreme"
***************
*** 280,283 ****
--- 294,299 ----
                     'best_cutoff_fp_weight': float_cracker,
                    },
+     'CV Driver': {'build_each_classifier_from_scratch': boolean_cracker,
+                  },
      'Classifier': {'max_discriminators': int_cracker,
                     'robinson_probability_x': float_cracker,

Index: timcv.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timcv.py,v
retrieving revision 1.9
retrieving revision 1.10
diff -C2 -d -r1.9 -r1.10
*** timcv.py	24 Sep 2002 05:37:11 -0000	1.9
--- timcv.py	10 Oct 2002 04:55:15 -0000	1.10
***************
*** 68,73 ****

          if i > 0:
!             # Forget this set.
!             d.untrain(hamstream, spamstream)

          # Predict this set.
--- 68,88 ----

          if i > 0:
!             if options.build_each_classifier_from_scratch:
!                 # Build a new classifier from the other sets.
!                 d.new_classifier()
! 
!                 hname = "%s-%d, except %d" % (hamdirs[0], nsets, i+1)
!                 h2 = hamdirs[:]
!                 del h2[i]
! 
!                 sname = "%s-%d, except %d" % (spamdirs[0], nsets, i+1)
!                 s2 = spamdirs[:]
!                 del s2[i]
! 
!                 d.train(msgs.HamStream(hname, h2), msgs.SpamStream(sname, s2))
! 
!             else:
!                 # Forget this set.
!                 d.untrain(hamstream, spamstream)

          # Predict this set.
***************
*** 75,79 ****
          d.finishtest()

!         if i < nsets - 1:
              # Add this set back in.
              d.train(hamstream, spamstream)
--- 90,94 ----
          d.finishtest()

!         if i < nsets - 1 and not options.build_each_classifier_from_scratch:
              # Add this set back in.
              d.train(hamstream, spamstream)