[Spambayes-checkins] spambayes Options.py,1.14,1.15
classifier.py,1.9,1.10
Tim Peters
tim_one@users.sourceforge.net
Sat, 14 Sep 2002 17:01:50 -0700
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv6591
Modified Files:
Options.py classifier.py
Log Message:
New bool option [Classifier]adjust_probs_by_evidence_mass. See the
mailing list for details. By default, this is turned off.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.14
retrieving revision 1.15
diff -C2 -d -r1.14 -r1.15
*** Options.py 14 Sep 2002 20:08:07 -0000 1.14
--- Options.py 15 Sep 2002 00:01:48 -0000 1.15
***************
*** 119,122 ****
--- 119,126 ----
max_discriminators: 16
+
+ # Speculative change to allow giving probabilities more weight the more
+ # messages went into computing them.
+ adjust_probs_by_evidence_mass: False
"""
***************
*** 152,155 ****
--- 156,160 ----
'unknown_spamprob': float_cracker,
'max_discriminators': int_cracker,
+ 'adjust_probs_by_evidence_mass': boolean_cracker,
},
}
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.9
retrieving revision 1.10
diff -C2 -d -r1.9 -r1.10
*** classifier.py 13 Sep 2002 19:46:41 -0000 1.9
--- classifier.py 15 Sep 2002 00:01:48 -0000 1.10
***************
*** 547,550 ****
--- 547,551 ----
nham = float(self.nham or 1)
nspam = float(self.nspam or 1)
+ fiddle = options.adjust_probs_by_evidence_mass
for word,record in self.wordinfo.iteritems():
# Compute prob(msg is spam | msg contains word).
***************
*** 560,570 ****
prob = MAX_SPAMPROB
!
! ## if prob != 0.5:
! ## confbias = 0.01 / (record.hamcount + record.spamcount)
! ## if prob > 0.5:
! ## prob = max(0.5, prob - confbias)
! ## else:
! ## prob = min(0.5, prob + confbias)
if record.spamprob != prob:
--- 561,581 ----
prob = MAX_SPAMPROB
! if fiddle:
! # Suppose two clues have spamprob 0.99. Which one is better?
! # One reasonable guess is that it's the one derived from the
! # most data. This code fiddles non-0.5 probabilities by
! # shrinking their distance to 0.5, but shrinking less the
! # more evidence went into computing them. Note that if this
! # proves to work, it should allow getting rid of the
! # "cancelling evidence" complications in spamprob()
! # (two probs exactly the same distance from 0.5 are far
! # less common after this transformation; instead, spamprob()
! # will pick up on the clues with the most evidence backing
! # them up).
! dist = prob - 0.5
! if dist:
! sum = float(record.hamcount + record.spamcount)
! dist *= sum / (sum + 1.0)
! prob = 0.5 + dist
if record.spamprob != prob: