[Spambayes-checkins] spambayes Options.py,1.44,1.45 timcv.py,1.9,1.10
Tim Peters
tim_one@users.sourceforge.net
Wed, 09 Oct 2002 21:55:17 -0700
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv26404
Modified Files:
Options.py timcv.py
Log Message:
Adapted from a patch by T. Alexander Popiel, this adds new option (and
in a new section)
[CV Driver]
build_each_classifier_from_scratch: False
When True, a cross-validation driver can be used safely-- but more
slowly --with a central-limit test. timcv.py pays attention to this.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.44
retrieving revision 1.45
diff -C2 -d -r1.44 -r1.45
*** Options.py 10 Oct 2002 00:23:51 -0000 1.44
--- Options.py 10 Oct 2002 04:55:15 -0000 1.45
***************
*** 173,176 ****
--- 173,190 ----
ham_directories: Data/Ham/Set%d
+ [CV Driver]
+ # A cross-validation driver takes N ham+spam sets, and builds N classifiers,
+ # training each on N-1 sets, and the predicting against the set not trained
+ # on. By default, it does this in a clever way, learning *and* unlearning
+ # sets as it goes along, so that it never needs to train on N-1 sets in one
+ # gulp after the first time. However, that can't always be done: in
+ # particular, the central-limit schemes can't unlearn incrementally, and can
+ # learn incrementally only via a form of cheating whose bad effects overall
+ # aren't yet known.
+ # So when desiring to run a central-limit test, set
+ # build_each_classifier_from_scratch to true. This gives correct results,
+ # but runs much slower than a CV driver usually runs.
+ build_each_classifier_from_scratch: False
+
[Classifier]
# The maximum number of extreme words to look at in a msg, where "extreme"
***************
*** 280,283 ****
--- 294,299 ----
'best_cutoff_fp_weight': float_cracker,
},
+ 'CV Driver': {'build_each_classifier_from_scratch': boolean_cracker,
+ },
'Classifier': {'max_discriminators': int_cracker,
'robinson_probability_x': float_cracker,
Index: timcv.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timcv.py,v
retrieving revision 1.9
retrieving revision 1.10
diff -C2 -d -r1.9 -r1.10
*** timcv.py 24 Sep 2002 05:37:11 -0000 1.9
--- timcv.py 10 Oct 2002 04:55:15 -0000 1.10
***************
*** 68,73 ****
if i > 0:
! # Forget this set.
! d.untrain(hamstream, spamstream)
# Predict this set.
--- 68,88 ----
if i > 0:
! if options.build_each_classifier_from_scratch:
! # Build a new classifier from the other sets.
! d.new_classifier()
!
! hname = "%s-%d, except %d" % (hamdirs[0], nsets, i+1)
! h2 = hamdirs[:]
! del h2[i]
!
! sname = "%s-%d, except %d" % (spamdirs[0], nsets, i+1)
! s2 = spamdirs[:]
! del s2[i]
!
! d.train(msgs.HamStream(hname, h2), msgs.SpamStream(sname, s2))
!
! else:
! # Forget this set.
! d.untrain(hamstream, spamstream)
# Predict this set.
***************
*** 75,79 ****
d.finishtest()
! if i < nsets - 1:
# Add this set back in.
d.train(hamstream, spamstream)
--- 90,94 ----
d.finishtest()
! if i < nsets - 1 and not options.build_each_classifier_from_scratch:
# Add this set back in.
d.train(hamstream, spamstream)