[Spambayes-checkins] spambayes Options.py,1.14,1.15 classifier.py,1.9,1.10

Sat, 14 Sep 2002 17:01:50 -0700

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv6591

Modified Files:
	Options.py classifier.py 
Log Message:
New bool option [Classifier]adjust_probs_by_evidence_mass.  See the
mailing list for details.  By default, this is turned off.


Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.14
retrieving revision 1.15
diff -C2 -d -r1.14 -r1.15
*** Options.py	14 Sep 2002 20:08:07 -0000	1.14
--- Options.py	15 Sep 2002 00:01:48 -0000	1.15
***************
*** 119,122 ****
--- 119,126 ----
  
  max_discriminators: 16
+ 
+ # Speculative change to allow giving probabilities more weight the more
+ # messages went into computing them.
+ adjust_probs_by_evidence_mass: False
  """
  
***************
*** 152,155 ****
--- 156,160 ----
                     'unknown_spamprob': float_cracker,
                     'max_discriminators': int_cracker,
+                    'adjust_probs_by_evidence_mass': boolean_cracker,
                     },
  }

Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.9
retrieving revision 1.10
diff -C2 -d -r1.9 -r1.10
*** classifier.py	13 Sep 2002 19:46:41 -0000	1.9
--- classifier.py	15 Sep 2002 00:01:48 -0000	1.10
***************
*** 547,550 ****
--- 547,551 ----
          nham = float(self.nham or 1)
          nspam = float(self.nspam or 1)
+         fiddle = options.adjust_probs_by_evidence_mass
          for word,record in self.wordinfo.iteritems():
              # Compute prob(msg is spam | msg contains word).
***************
*** 560,570 ****
                  prob = MAX_SPAMPROB
  
! 
! ##            if prob != 0.5:
! ##                confbias = 0.01 / (record.hamcount + record.spamcount)
! ##                if prob > 0.5:
! ##                    prob = max(0.5, prob - confbias)
! ##                else:
! ##                    prob = min(0.5, prob + confbias)
  
              if record.spamprob != prob:
--- 561,581 ----
                  prob = MAX_SPAMPROB
  
!             if fiddle:
!                 # Suppose two clues have spamprob 0.99.  Which one is better?
!                 # One reasonable guess is that it's the one derived from the
!                 # most data.  This code fiddles non-0.5 probabilities by
!                 # shrinking their distance to 0.5, but shrinking less the
!                 # more evidence went into computing them.  Note that if this
!                 # proves to work, it should allow getting rid of the
!                 # "cancelling evidence" complications in spamprob()
!                 # (two probs exactly the same distance from 0.5 are far
!                 # less common after this transformation; instead, spamprob()
!                 # will pick up on the clues with the most evidence backing
!                 # them up).
!                 dist = prob - 0.5
!                 if dist:
!                     sum = float(record.hamcount + record.spamcount)
!                     dist *= sum / (sum + 1.0)
!                     prob = 0.5 + dist
  
              if record.spamprob != prob: