[Spambayes-checkins] spambayes/spambayes CostCounter.py, 1.3, 1.4 ImapUI.py, 1.12, 1.13 Options.py, 1.53, 1.54 ProxyUI.py, 1.10, 1.11

Sun May 25 18:00:56 EDT 2003

Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv7707/spambayes

Modified Files:
	CostCounter.py ImapUI.py Options.py ProxyUI.py 
Log Message:
Update CostCounter to new options style.

Expose experimental ham/spam imbalance option to
pop3proxy/imapfilter users, and update doc to be easier to
understand (thanks to Paul Moore).

Add a couple of notes to incremental.txt

Index: CostCounter.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/CostCounter.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** CostCounter.py	29 Jan 2003 03:23:34 -0000	1.3
--- CostCounter.py	26 May 2003 00:00:52 -0000	1.4
***************
*** 73,79 ****
          self._total += 1
          self._spam += 1
!         if scr < options.ham_cutoff:
              self._fn += 1
!         elif scr < options.spam_cutoff:
              self._unsure += 1
              self._unsurespam += 1
--- 73,79 ----
          self._total += 1
          self._spam += 1
!         if scr < options["Categorization", "ham_cutoff"]:
              self._fn += 1
!         elif scr < options["Categorization", "spam_cutoff"]:
              self._unsure += 1
              self._unsurespam += 1
***************
*** 84,90 ****
          self._total += 1
          self._ham += 1
!         if scr > options.spam_cutoff:
              self._fp += 1
!         elif scr > options.ham_cutoff:
              self._unsure += 1
              self._unsureham += 1
--- 84,90 ----
          self._total += 1
          self._ham += 1
!         if scr > options["Categorization", "spam_cutoff"]:
              self._fp += 1
!         elif scr > options["Categorization", "ham_cutoff"]:
              self._unsure += 1
              self._unsureham += 1
***************
*** 118,156 ****
      name = "Standard Cost"
      def spam(self, scr):
!         if scr < options.ham_cutoff:
!             self.total += options.best_cutoff_fn_weight
!         elif scr < options.spam_cutoff:
!             self.total += options.best_cutoff_unsure_weight

      def ham(self, scr):
!         if scr > options.spam_cutoff:
!             self.total += options.best_cutoff_fp_weight
!         elif scr > options.ham_cutoff:
!             self.total += options.best_cutoff_unsure_weight

  class FlexCostCounter(CostCounter):
      name = "Flex Cost"
      def _lambda(self, scr):
!         if scr < options.ham_cutoff:
              return 0
!         elif scr > options.spam_cutoff:
              return 1
          else:
!             return (scr - options.ham_cutoff) / (
!                       options.spam_cutoff - options.ham_cutoff)

      def spam(self, scr):
!         self.total += (1 - self._lambda(scr)) * options.best_cutoff_fn_weight

      def ham(self, scr):
!         self.total += self._lambda(scr) * options.best_cutoff_fp_weight

  class Flex2CostCounter(FlexCostCounter):
      name = "Flex**2 Cost"
      def spam(self, scr):
!         self.total += (1 - self._lambda(scr))**2 * options.best_cutoff_fn_weight

      def ham(self, scr):
!         self.total += self._lambda(scr)**2 * options.best_cutoff_fp_weight

  def default():
--- 118,161 ----
      name = "Standard Cost"
      def spam(self, scr):
!         if scr < options["Categorization", "ham_cutoff"]:
!             self.total += options["TestDriver", "best_cutoff_fn_weight"]
!         elif scr < options["Categorization", "spam_cutoff"]:
!             self.total += options["TestDriver", "best_cutoff_unsure_weight"]

      def ham(self, scr):
!         if scr > options["Categorization", "spam_cutoff"]:
!             self.total += options["TestDriver", "best_cutoff_fp_weight"]
!         elif scr > options["Categorization", "ham_cutoff"]:
!             self.total += options["TestDriver", "best_cutoff_unsure_weight"]

  class FlexCostCounter(CostCounter):
      name = "Flex Cost"
      def _lambda(self, scr):
!         if scr < options["Categorization", "ham_cutoff"]:
              return 0
!         elif scr > options["Categorization", "spam_cutoff"]:
              return 1
          else:
!             return (scr - options["Categorization", "ham_cutoff"]) / (
!                       options["Categorization", "spam_cutoff"] \
!                       - options["Categorization", "ham_cutoff"])

      def spam(self, scr):
!         self.total += (1 - self._lambda(scr)) * options["TestDriver",
!                                                         "best_cutoff_fn_weight"]

      def ham(self, scr):
!         self.total += self._lambda(scr) * options["TestDriver",
!                                                   "best_cutoff_fp_weight"]

  class Flex2CostCounter(FlexCostCounter):
      name = "Flex**2 Cost"
      def spam(self, scr):
!         self.total += (1 - self._lambda(scr))**2 * options["TestDriver",
!                                                            "best_cutoff_fn_weight"]

      def ham(self, scr):
!         self.total += self._lambda(scr)**2 * options["TestDriver",
!                                                      "best_cutoff_fp_weight"]

  def default():
***************
*** 182,186 ****
      cc.ham(0.5)
      cc.spam(0.5)
!     options.spam_cutoff=0.7
!     options.ham_cutoff=0.4
      print cc
--- 187,191 ----
      cc.ham(0.5)
      cc.spam(0.5)
!     options["Categorization", "spam_cutoff"]=0.7
!     options["Categorization", "ham_cutoff"]=0.4
      print cc

Index: ImapUI.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/ImapUI.py,v
retrieving revision 1.12
retrieving revision 1.13
diff -C2 -d -r1.12 -r1.13
*** ImapUI.py	22 May 2003 05:21:16 -0000	1.12
--- ImapUI.py	26 May 2003 00:00:52 -0000	1.13
***************
*** 79,82 ****
--- 79,83 ----
      ('Categorization',      'ham_cutoff'),
      ('Categorization',      'spam_cutoff'),
+     ('Classifier',          'experimental_ham_spam_imbalance_adjustment'),
  )

Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/Options.py,v
retrieving revision 1.53
retrieving revision 1.54
diff -C2 -d -r1.53 -r1.54
*** Options.py	14 May 2003 00:18:19 -0000	1.53
--- Options.py	26 May 2003 00:00:53 -0000	1.54
***************
*** 556,572 ****
       BOOLEAN, RESTORE),

!     ("experimental_ham_spam_imbalance_adjustment", "Correct for imbalanced ham/spam ratio", False,
!      """If the # of ham and spam in training data are out of balance, the
!      spamprob guesses can get stronger in the direction of the category
!      with more training msgs.  In one sense this must be so, since the more
!      data we have of one flavor, the more we know about that flavor.  But
!      that allows the accidental appearance of a strong word of that flavor
!      in a msg of the other flavor much more power than an accident in the
!      other direction.  Enable experimental_ham_spam_imbalance_adjustment if
!      you have more ham than spam training data (or more spam than ham), and
!      the Bayesian probability adjustment won't 'believe' raw counts more
!      than min(# ham trained on, # spam trained on) justifies.  I *expect*
!      this option will go away (and become the default), but people *with*
!      strong imbalance need to test it first.""",
       BOOLEAN, RESTORE),
    ),
--- 556,582 ----
       BOOLEAN, RESTORE),

!     # If the # of ham and spam in training data are out of balance, the
!     # spamprob guesses can get stronger in the direction of the category
!     # with more training msgs.  In one sense this must be so, since the more
!     # data we have of one flavor, the more we know about that flavor.  But
!     # that allows the accidental appearance of a strong word of that flavor
!     # in a msg of the other flavor much more power than an accident in the
!     # other direction.  Enable experimental_ham_spam_imbalance_adjustment if
!     # you have more ham than spam training data (or more spam than ham), and
!     # the Bayesian probability adjustment won't 'believe' raw counts more
!     # than min(# ham trained on, # spam trained on) justifies.  I *expect*
!     # this option will go away (and become the default), but people *with*
!     # strong imbalance need to test it first.
!     
!     ("experimental_ham_spam_imbalance_adjustment", "Compensate for unequal numbers of spam and ham", False,
!      """If your training database has significantly (3 times) more ham than
!      spam, or vice versa, you may start seeing an increase in incorrect
!      classifications (messages put in the wrong category, not just marked
!      as unsure). If so, this option allows you to compensate for this, at
!      the cost of increasing the number of messages classified as "unsure".
! 
!      Note that the effect is subtle, and you should experiment with both
!      settings to choose the option that suits you best. You do not have
!      to retrain your database if you change this option.""",
       BOOLEAN, RESTORE),
    ),

Index: ProxyUI.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/ProxyUI.py,v
retrieving revision 1.10
retrieving revision 1.11
diff -C2 -d -r1.10 -r1.11
*** ProxyUI.py	22 May 2003 05:21:16 -0000	1.10
--- ProxyUI.py	26 May 2003 00:00:53 -0000	1.11
***************
*** 104,107 ****
--- 104,108 ----
      ('Categorization',      'ham_cutoff'),
      ('Categorization',      'spam_cutoff'),
+     ('Classifier',          'experimental_ham_spam_imbalance_adjustment'),
  )