[Spambayes-checkins] spambayes rmspik.py,1.1,1.2
Tim Peters
tim_one@users.sourceforge.net
Sat, 05 Oct 2002 22:24:12 -0700
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv13405
Modified Files:
rmspik.py
Log Message:
The module docstring makes some sense now.
Added horizontal whitespace to overly busy expressions.
Added XXX comment about chance()'s problems with the original
use_central_limit.
Slashed the number of int->float conversions needed by chance().
Index: rmspik.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/rmspik.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** rmspik.py 5 Oct 2002 23:45:59 -0000 1.1
--- rmspik.py 6 Oct 2002 05:24:10 -0000 1.2
***************
*** 5,39 ****
"""Usage: %(program)s [options] [central_limit_pickle_file]
! An example analysis program showing to access info from a central-limit
! pickle file created by clgen.py. This program produces histograms of
! various things.
!
! Scores for all predictions are saved at the end of binary pickle clim.pik.
! This contains two lists of tuples, the first list with a tuple for every
! ham predicted, the second list with a tuple for every spam predicted. Each
! tuple has these values:
!
! tag the msg identifier
! is_spam True if msg came from a spam Set, False if from a ham Set
! zham the msg zscore relative to the population ham
! zspam the msg zscore relative to the population spam
! hmean the raw mean ham score
! smean the raw mean spam score
! n the number of clues used to judge this msg
!
! Note that hmean and smean are the same under use_central_limit; they're
! very likely to differ under use_central_limit2.
!
! Where:
-h
Show usage and exit.
If no file is named on the cmdline, clim.pik is used.
"""
! surefactor = 1000 # This is basically the inverse of the accepted fp/fn rate
! punsure = False # Print unsure decisions (otherwise only sure-but-false)
! import sys,math,os
import cPickle as pickle
--- 5,23 ----
"""Usage: %(program)s [options] [central_limit_pickle_file]
! Options
-h
Show usage and exit.
+ Analyzes a pickle produced by clgen.py, and displays what would happen
+ if Rob Hooft's "RMS ZScore" scheme had been used to determine certainty
+ instead.
+
If no file is named on the cmdline, clim.pik is used.
"""
! surefactor = 1000 # This is basically the inverse of the accepted fp/fn rate
! punsure = False # Print unsure decisions (otherwise only sure-but-false)
! import sys, math, os
import cPickle as pickle
***************
*** 49,62 ****
def chance(x):
! if x>=0:
return 1.0
! x=-x/math.sqrt(2)
! if x<1.4:
return 1.0
! assert x>=1.4
! x=float(x)
! pre=math.exp(-x**2)/math.sqrt(math.pi)/x
! post=1-(1/(2*x**2))
! return pre*post
knownfalse = {}
--- 33,47 ----
def chance(x):
! # XXX These 3 lines are a disaster for spam using the original
! # use_central_limit. Replacing with x = abs(x)/sqrt(2) works
! # very well then.
! if x >= 0:
return 1.0
! x = -x / math.sqrt(2.0)
! if x < 1.4:
return 1.0
! pre = math.exp(-x**2) / math.sqrt(math.pi) / x
! post = 1.0 - (1.0 / (2.0 * x**2))
! return pre * post
knownfalse = {}
***************
*** 79,82 ****
--- 64,77 ----
if bn in knownfalse:
print " ==>", knownfalse[bn]
+
+ # Pickle tuple contents:
+ #
+ # 0 tag the msg identifier
+ # 1 is_spam True if msg came from a spam Set, False if from a ham Set
+ # 2 zham the msg zscore relative to the population ham
+ # 3 zspam the msg zscore relative to the population spam
+ # 4 hmean the raw mean ham score
+ # 5 smean the raw mean spam score
+ # 6 n the number of clues used to judge this msg
def drive(fname):