[Spambayes-checkins] spambayes/spambayes Options.py, 1.86, 1.87 classifier.py, 1.10, 1.11

Sat Dec 13 23:12:34 EST 2003

Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv20174/spambayes

Modified Files:
	Options.py classifier.py 
Log Message:
Removed support code for the defunct
experimental_ham_spam_imbalance_adjustment option, and fiddled some docs
accordingly.  Options.py still knows about it, and various UI components
building on Options.py, to avoid breaking anything that's using it.
Unsure to get rid of it completely.


Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/Options.py,v
retrieving revision 1.86
retrieving revision 1.87
diff -C2 -d -r1.86 -r1.87
*** Options.py	14 Dec 2003 01:34:42 -0000	1.86
--- Options.py	14 Dec 2003 04:12:32 -0000	1.87
***************
*** 409,413 ****
      # than min(# ham trained on, # spam trained on) justifies.  I *expect*
      # this option will go away (and become the default), but people *with*
!     # strong imbalance need to test it first.
  
      ("experimental_ham_spam_imbalance_adjustment", "Compensate for unequal numbers of spam and ham", False,
--- 409,416 ----
      # than min(# ham trained on, # spam trained on) justifies.  I *expect*
      # this option will go away (and become the default), but people *with*
!     # strong imbalance need to test it first.\
!     # LATER:  this option sucked, creating more problems than it solved.
!     # XXX The code in classifier.py is gone now.  How can we get rid of
!     # XXX the option gracefully?
  
      ("experimental_ham_spam_imbalance_adjustment", "Compensate for unequal numbers of spam and ham", False,

Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/classifier.py,v
retrieving revision 1.10
retrieving revision 1.11
diff -C2 -d -r1.10 -r1.11
*** classifier.py	15 Sep 2003 01:08:11 -0000	1.10
--- classifier.py	14 Dec 2003 04:12:32 -0000	1.11
***************
*** 248,257 ****
          prob = spamratio / (hamratio + spamratio)
  
-         if options["Classifier", "experimental_ham_spam_imbalance_adjustment"]:
-             spam2ham = min(nspam / nham, 1.0)
-             ham2spam = min(nham / nspam, 1.0)
-         else:
-             spam2ham = ham2spam = 1.0
- 
          S = options["Classifier", "unknown_word_strength"]
          StimesX = S * options["Classifier", "unknown_word_prob"]
--- 248,251 ----
***************
*** 274,296 ****
          # less so the larger n is, or the smaller s is.
  
!         # Experimental:
!         # Picking a good value for n is interesting:  how much empirical
!         # evidence do we really have?  If nham == nspam,
!         # hamcount + spamcount makes a lot of sense, and the code here
!         # does that by default.
!         # But if, e.g., nham is much larger than nspam, p(w) can get a
!         # lot closer to 0.0 than it can get to 1.0.  That in turn makes
!         # strong ham words (high hamcount) much stronger than strong
!         # spam words (high spamcount), and that makes the accidental
!         # appearance of a strong ham word in spam much more damaging than
!         # the accidental appearance of a strong spam word in ham.
!         # So we don't give hamcount full credit when nham > nspam (or
!         # spamcount when nspam > nham):  instead we knock hamcount down
!         # to what it would have been had nham been equal to nspam.  IOW,
!         # we multiply hamcount by nspam/nham when nspam < nham; or, IOOW,
!         # we don't "believe" any count to an extent more than
!         # min(nspam, nham) justifies.
! 
!         n = hamcount * spam2ham  +  spamcount * ham2spam
          prob = (StimesX + n * prob) / (S + n)
  
--- 268,272 ----
          # less so the larger n is, or the smaller s is.
  
!         n = hamcount + spamcount
          prob = (StimesX + n * prob) / (S + n)
  
***************
*** 380,384 ****
          this point.  Introduced to fix bug #797890."""
          pass
!     
      def _getclues(self, wordstream):
          mindist = options["Classifier", "minimum_prob_strength"]
--- 356,360 ----
          this point.  Introduced to fix bug #797890."""
          pass
! 
      def _getclues(self, wordstream):
          mindist = options["Classifier", "minimum_prob_strength"]