[Spambayes-checkins]
spambayes Options.py,1.32,1.33 TestDriver.py,1.12,1.13
Tim Peters
tim_one@users.sourceforge.net
Wed, 25 Sep 2002 11:39:22 -0700
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv15075
Modified Files:
Options.py TestDriver.py
Log Message:
New option best_cutoff_fp_weight. The histogram analysis code now
finds the buckets that minimize
best_cutoff_fp_weight * (# false positives) + (# false negatives)
By default it's 1 (minimize total # of misclassified msgs). If, e.g.,
you're happy to endure 100 false negatives to save 1 false positive,
set to 100. Don't be surprised if your f-n rate zooms, though!
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.32
retrieving revision 1.33
diff -C2 -d -r1.32 -r1.33
*** Options.py 25 Sep 2002 05:22:47 -0000 1.32
--- Options.py 25 Sep 2002 18:39:17 -0000 1.33
***************
*** 102,108 ****
# well as 0.90 on Tim's large c.l.py data).
# For Gary Robinson's scheme, some value between 0.50 and 0.60 has worked
! # best in all reports so far. Note that you can easily deduce the effect
! # of setting spam_cutoff to any particular value by studying the score
! # histograms -- there's no need to run a test again to see what would happen.
spam_cutoff: 0.90
--- 102,106 ----
# well as 0.90 on Tim's large c.l.py data).
# For Gary Robinson's scheme, some value between 0.50 and 0.60 has worked
! # best in all reports so far.
spam_cutoff: 0.90
***************
*** 111,119 ****
show_histograms: True
! # When compute_best_cutoffs_from_histograms is enabled, after the display
! # of a ham+spam histogram pair, a listing is given of all the cutoff scores
! # (coinciding with a histogram boundary) that minimize the total number of
! # misclassified messages (false positives + false negatives).
compute_best_cutoffs_from_histograms: True
# Display spam when
--- 109,127 ----
show_histograms: True
! # After the display of a ham+spam histogram pair, you can get a listing of
! # all the cutoff values (coinciding histogram bucket boundaries) that
! # minimize
! #
! # best_cutoff_fp_weight * (# false positives) + (# false negatives)
! #
! # By default, best_cutoff_fp_weight is 1, and so the cutoffs that miminize
! # the total number of misclassified messages (fp+fn) are shown. If you hate
! # fp more than fn, set the weight to something larger than 1. For example,
! # if you're willing to endure 100 false negatives to save 1 false positive,
! # set it to 100.
! # Note: You may wish to increase nbuckets, to give this scheme more cutoff
! # values to analyze.
compute_best_cutoffs_from_histograms: True
+ best_cutoff_fp_weight: 1
# Display spam when
***************
*** 254,257 ****
--- 262,266 ----
'ham_directories': string_cracker,
'compute_best_cutoffs_from_histograms': boolean_cracker,
+ 'best_cutoff_fp_weight': float_cracker,
},
'Classifier': {'hambias': float_cracker,
Index: TestDriver.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v
retrieving revision 1.12
retrieving revision 1.13
diff -C2 -d -r1.12 -r1.13
*** TestDriver.py 25 Sep 2002 05:22:47 -0000 1.12
--- TestDriver.py 25 Sep 2002 18:39:17 -0000 1.13
***************
*** 102,108 ****
# and every ham is a false positive.
assert ham.nbuckets == spam.nbuckets
fp = ham.n
fn = 0
! best_total = fp
bests = [(0, fp, fn)]
for i in range(ham.nbuckets):
--- 102,109 ----
# and every ham is a false positive.
assert ham.nbuckets == spam.nbuckets
+ fpw = options.best_cutoff_fp_weight
fp = ham.n
fn = 0
! best_total = fpw * fp + fn
bests = [(0, fp, fn)]
for i in range(ham.nbuckets):
***************
*** 111,117 ****
fp -= ham.buckets[i]
fn += spam.buckets[i]
! if fp + fn <= best_total:
! if fp + fn < best_total:
! best_total = fp + fn
bests = []
bests.append((i+1, fp, fn))
--- 112,119 ----
fp -= ham.buckets[i]
fn += spam.buckets[i]
! total = fpw * fp + fn
! if total <= best_total:
! if total < best_total:
! best_total = total
bests = []
bests.append((i+1, fp, fn))
***************
*** 121,128 ****
i, fp, fn = bests.pop(0)
print '-> best cutoff for', tag, float(i) / ham.nbuckets
! print '-> with', fp, 'fp', '+', fn, 'fn =', best_total, 'mistakes'
for i, fp, fn in bests:
! print '-> matched at %g (%d fp + %d fn)' % (
! float(i) / ham.nbuckets, fp, fn)
--- 123,135 ----
i, fp, fn = bests.pop(0)
print '-> best cutoff for', tag, float(i) / ham.nbuckets
! print '-> with weighted total %g*%d fp + %d fn = %g' % (
! fpw, fp, fn, best_total)
! print '-> fp rate %.3g%% fn rate %.3g%%' % (
! fp * 1e2 / ham.n, fn * 1e2 / spam.n)
for i, fp, fn in bests:
! print ('-> matched at %g with %d fp & %d fn; '
! 'fp rate %.3g%%; fn rate %.3g%%' % (
! float(i) / ham.nbuckets, fp, fn,
! fp * 1e2 / ham.n, fn * 1e2 / spam.n))