[Spambayes-checkins] spambayes Options.py,1.35,1.36 classifier.py,1.22,1.23

Fri, 27 Sep 2002 20:41:12 -0700

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv6007

Modified Files:
	Options.py classifier.py 
Log Message:
Gary Robinson changed the forumla he uses to adjust the Graham
probabilities since we first implemented it.  The new formula is
identical to the old in what it computes, but it looks a little different
and is easier to understand.  As a result,

     robinson_probability_a

no longer exists, and

     robinson_probability_s

takes its place (the "s" is for "strength").  If you used non-default
values of a and/or x before, x doesn't change, but you should set

     robinson_probability_s

to robinson_probability_a / robinson_probability_x.

For example, before this checkin, the defaults were a=0.225 and x= 0.5.
Now 'a' is gone, and s defaults to 0.225/0.5 = 0.45.  Computed results
are identical.

Sorry for the hassle, but Gary's webpage does a very nice job of
explaining this formula, and I really don't want to reword it all for
this project -- keeping an obvious connection between our implementation
and Gary's explanation is worth the disruption.

Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.35
retrieving revision 1.36
diff -C2 -d -r1.35 -r1.36
*** Options.py	27 Sep 2002 22:29:56 -0000	1.35
--- Options.py	28 Sep 2002 03:41:10 -0000	1.36
***************
*** 179,194 ****
  # seen before.  Nobody has reported an improvement via moving it away
  # from 1/2.
! # "a" adjusts how much weight to give the prior assumption relative to
! # the probabilities estimated by counting.  At a=0, the counting estimates
  # are believed 100%, even to the extent of assigning certainty (0 or 1)
  # to a word that's appeared in only ham or only spam.  This is a disaster.
! # As "a" tends toward infintity, all probabilities tend toward "x".  All
! # reports were that a value near 0.2 worked best, so this doesn't seem to
  # be corpus-dependent.
! # XXX Gary Robinson has since renamed "a" to "s", and redone his formulas
! # XXX to make it a measure of belief strength rather than "a number" from
! # XXX 0 to infinity.  We haven't caught up to that yet.
! robinson_probability_a: 0.225
  robinson_probability_x: 0.5

  # When scoring a message, ignore all words with
--- 179,194 ----
  # seen before.  Nobody has reported an improvement via moving it away
  # from 1/2.
! # "s" adjusts how much weight to give the prior assumption relative to
! # the probabilities estimated by counting.  At s=0, the counting estimates
  # are believed 100%, even to the extent of assigning certainty (0 or 1)
  # to a word that's appeared in only ham or only spam.  This is a disaster.
! # As s tends toward infintity, all probabilities tend toward x.  All
! # reports were that a value near 0.4 worked best, so this doesn't seem to
  # be corpus-dependent.
! # NOTE:  Gary Robinson previously used a different formula involving 'a'
! # and 'x'.  The 'x' here is the same as before.  The 's' here is the old
! # 'a' divided by 'x'.
  robinson_probability_x: 0.5
+ robinson_probability_s: 0.45

  # When scoring a message, ignore all words with
***************
*** 254,259 ****
                    },
      'Classifier': {'max_discriminators': int_cracker,
-                    'robinson_probability_a': float_cracker,
                     'robinson_probability_x': float_cracker,
                     'robinson_minimum_prob_strength': float_cracker,

--- 254,259 ----
                    },
      'Classifier': {'max_discriminators': int_cracker,
                     'robinson_probability_x': float_cracker,
+                    'robinson_probability_s': float_cracker,
                     'robinson_minimum_prob_strength': float_cracker,

Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.22
retrieving revision 1.23
diff -C2 -d -r1.22 -r1.23
*** classifier.py	27 Sep 2002 22:29:56 -0000	1.22
--- classifier.py	28 Sep 2002 03:41:10 -0000	1.23
***************
*** 228,234 ****
          nham = float(self.nham or 1)
          nspam = float(self.nspam or 1)
!         A = options.robinson_probability_a
!         X = options.robinson_probability_x
!         AoverX = A/X
          for word, record in self.wordinfo.iteritems():
              # Compute prob(msg is spam | msg contains word).
--- 228,233 ----
          nham = float(self.nham or 1)
          nspam = float(self.nspam or 1)
!         S = options.robinson_probability_s
!         StimesX = S * options.robinson_probability_x
          for word, record in self.wordinfo.iteritems():
              # Compute prob(msg is spam | msg contains word).
***************
*** 248,257 ****
              # Now do Robinson's Bayesian adjustment.
              #
!             #         a + (n * p(w))
!             # f(w) = ---------------
!             #          (a / x) + n

              n = hamcount + spamcount
!             prob = (A + n * prob) / (AoverX + n)

              if record.spamprob != prob:
--- 247,256 ----
              # Now do Robinson's Bayesian adjustment.
              #
!             #         s*x + n*p(w)
!             # f(w) = --------------
!             #           s + n

              n = hamcount + spamcount
!             prob = (StimesX + n * prob) / (S + n)

              if record.spamprob != prob: