[Spambayes-checkins] spambayes Options.py,1.32,1.33 TestDriver.py,1.12,1.13

Wed, 25 Sep 2002 11:39:22 -0700

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv15075

Modified Files:
	Options.py TestDriver.py 
Log Message:
New option best_cutoff_fp_weight.  The histogram analysis code now
finds the buckets that minimize

    best_cutoff_fp_weight * (# false positives) + (# false negatives)

By default it's 1 (minimize total # of misclassified msgs).  If, e.g.,
you're happy to endure 100 false negatives to save 1 false positive,
set to 100.  Don't be surprised if your f-n rate zooms, though!

Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.32
retrieving revision 1.33
diff -C2 -d -r1.32 -r1.33
*** Options.py	25 Sep 2002 05:22:47 -0000	1.32
--- Options.py	25 Sep 2002 18:39:17 -0000	1.33
***************
*** 102,108 ****
  # well as 0.90 on Tim's large c.l.py data).
  # For Gary Robinson's scheme, some value between 0.50 and 0.60 has worked
! # best in all reports so far.  Note that you can easily deduce the effect
! # of setting spam_cutoff to any particular value by studying the score
! # histograms -- there's no need to run a test again to see what would happen.
  spam_cutoff: 0.90

--- 102,106 ----
  # well as 0.90 on Tim's large c.l.py data).
  # For Gary Robinson's scheme, some value between 0.50 and 0.60 has worked
! # best in all reports so far.
  spam_cutoff: 0.90

***************
*** 111,119 ****
  show_histograms: True

! # When compute_best_cutoffs_from_histograms is enabled, after the display
! # of a ham+spam histogram pair, a listing is given of all the cutoff scores
! # (coinciding with a histogram boundary) that minimize the total number of
! # misclassified messages (false positives + false negatives).
  compute_best_cutoffs_from_histograms: True

  # Display spam when
--- 109,127 ----
  show_histograms: True

! # After the display of a ham+spam histogram pair, you can get a listing of
! # all the cutoff values (coinciding histogram bucket boundaries) that
! # minimize
! #
! #      best_cutoff_fp_weight * (# false positives) + (# false negatives)
! #
! # By default, best_cutoff_fp_weight is 1, and so the cutoffs that miminize
! # the total number of misclassified messages (fp+fn) are shown.  If you hate
! # fp more than fn, set the weight to something larger than 1.  For example,
! # if you're willing to endure 100 false negatives to save 1 false positive,
! # set it to 100.
! # Note:  You may wish to increase nbuckets, to give this scheme more cutoff
! # values to analyze.
  compute_best_cutoffs_from_histograms: True
+ best_cutoff_fp_weight: 1

  # Display spam when
***************
*** 254,257 ****
--- 262,266 ----
                     'ham_directories': string_cracker,
                     'compute_best_cutoffs_from_histograms': boolean_cracker,
+                    'best_cutoff_fp_weight': float_cracker,
                    },
      'Classifier': {'hambias': float_cracker,

Index: TestDriver.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v
retrieving revision 1.12
retrieving revision 1.13
diff -C2 -d -r1.12 -r1.13
*** TestDriver.py	25 Sep 2002 05:22:47 -0000	1.12
--- TestDriver.py	25 Sep 2002 18:39:17 -0000	1.13
***************
*** 102,108 ****
      # and every ham is a false positive.
      assert ham.nbuckets == spam.nbuckets
      fp = ham.n
      fn = 0
!     best_total = fp
      bests = [(0, fp, fn)]
      for i in range(ham.nbuckets):
--- 102,109 ----
      # and every ham is a false positive.
      assert ham.nbuckets == spam.nbuckets
+     fpw = options.best_cutoff_fp_weight
      fp = ham.n
      fn = 0
!     best_total = fpw * fp + fn
      bests = [(0, fp, fn)]
      for i in range(ham.nbuckets):
***************
*** 111,117 ****
          fp -= ham.buckets[i]
          fn += spam.buckets[i]
!         if fp + fn <= best_total:
!             if fp + fn < best_total:
!                 best_total = fp + fn
                  bests = []
              bests.append((i+1, fp, fn))
--- 112,119 ----
          fp -= ham.buckets[i]
          fn += spam.buckets[i]
!         total = fpw * fp + fn
!         if total <= best_total:
!             if total < best_total:
!                 best_total = total
                  bests = []
              bests.append((i+1, fp, fn))
***************
*** 121,128 ****
      i, fp, fn = bests.pop(0)
      print '-> best cutoff for', tag, float(i) / ham.nbuckets
!     print '->     with', fp, 'fp', '+', fn, 'fn =', best_total, 'mistakes'
      for i, fp, fn in bests:
!         print '->     matched at %g (%d fp + %d fn)' % (
!               float(i) / ham.nbuckets, fp, fn)

--- 123,135 ----
      i, fp, fn = bests.pop(0)
      print '-> best cutoff for', tag, float(i) / ham.nbuckets
!     print '->     with weighted total %g*%d fp + %d fn = %g' % (
!           fpw, fp, fn, best_total)
!     print '->     fp rate %.3g%%  fn rate %.3g%%' % (
!           fp * 1e2 / ham.n, fn * 1e2 / spam.n)
      for i, fp, fn in bests:
!         print ('->     matched at %g with %d fp & %d fn; '
!                'fp rate %.3g%%; fn rate %.3g%%' % (
!                float(i) / ham.nbuckets, fp, fn,
!                fp * 1e2 / ham.n, fn * 1e2 / spam.n))