[Python-checkins] python/nondist/sandbox/spambayes classifier.py,1.8,1.9

Wed, 04 Sep 2002 17:02:31 -0700

Update of /cvsroot/python/python/nondist/sandbox/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv764

Modified Files:
	classifier.py 
Log Message:
A now-rare pure win, changing spamprob() to work harder to find more
evidence when competing 0.01 and 0.99 clues appear.  Before in the left
column, after in the right:

false positive percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.050  0.050  tied
    0.000  0.000  tied
    0.025  0.025  tied
    0.025  0.025  tied
    0.050  0.050  tied
    0.025  0.025  tied
    0.025  0.025  tied
    0.025  0.025  tied
    0.075  0.075  tied
    0.025  0.025  tied
    0.025  0.025  tied
    0.025  0.025  tied
    0.075  0.025  won
    0.025  0.025  tied
    0.025  0.025  tied
    0.000  0.000  tied
    0.025  0.025  tied
    0.050  0.050  tied

won   1 times
tied 19 times
lost  0 times

total unique fp went from 9 to 7

false negative percentages
    0.909  0.764  won
    0.800  0.691  won
    1.091  0.981  won
    1.381  1.309  won
    1.491  1.418  won
    1.055  0.873  won
    0.945  0.800  won
    1.236  1.163  won
    1.564  1.491  won
    1.200  1.200  tied
    1.454  1.381  won
    1.599  1.454  won
    1.236  1.164  won
    0.800  0.655  won
    0.836  0.655  won
    1.236  1.163  won
    1.236  1.200  won
    1.055  0.982  won
    1.127  0.982  won
    1.381  1.236  won

won  19 times
tied  1 times
lost  0 times

total unique fn went from 284 to 260


Index: classifier.py
===================================================================
RCS file: /cvsroot/python/python/nondist/sandbox/spambayes/classifier.py,v
retrieving revision 1.8
retrieving revision 1.9
diff -C2 -d -r1.8 -r1.9
*** classifier.py	2 Sep 2002 20:57:46 -0000	1.8
--- classifier.py	5 Sep 2002 00:02:28 -0000	1.9
***************
*** 8,11 ****
--- 8,12 ----
  import time
  from heapq import heapreplace
+ from sets import Set
  
  # The count of each word in ham is artificially boosted by a factor of
***************
*** 63,66 ****
--- 64,87 ----
  # even gets away from that oddity (8 of each allows for graceful ties,
  # which favor ham).
+ #
+ # XXX That should be revisited.  Stripping HTML tags from plain text msgs
+ # XXX later addressed some of the same problem cases.  The best value for
+ # XXX MAX_DISCRIMINATORS remains unknown, but increasing it a lot is known
+ # XXX to hurt.
+ #
+ # A twist:  When staring at failures, it wasn't unusual to see the top
+ # discriminators *all* have values of MIN_SPAMPROB and MAX_SPAMPROB.  The
+ # math is such that one MIN_SPAMPROB exactly cancels out one MAX_SPAMPROB,
+ # yielding no info at all.  Then whichever flavor of clue happened to reach
+ # MAX_DISCRIMINATORS//2 + 1 occurrences first determined the final outcome,
+ # based on almost no real evidence.
+ #
+ # So spamprob() was changed to save lists of *all* MIN_SPAMPROB and
+ # MAX_SPAMPROB clues.  If the number of those are equal, they're all ignored.
+ # Else the flavor with the smaller number of instances "cancels out" the
+ # same number of instances of the other flavor, and the remaining instances
+ # of the other flavor are fed into the probability computation.  This change
+ # was a pure win, lowering the false negative rate consistently, and it even
+ # managed to tickle a couple rare false positives into "not spam" terrority.
  MAX_DISCRIMINATORS = 16
  
***************
*** 133,139 ****
          """
  
-         if self.DEBUG:
-             print "spamprob(%r)" % wordstream
- 
          # A priority queue to remember the MAX_DISCRIMINATORS best
          # probabilities, where "best" means largest distance from 0.5.
--- 154,157 ----
***************
*** 142,159 ****
          smallest_best = -1.0
  
          # Counting a unique word multiple times hurts, although counting one
          # at most two times had some benefit whan UNKNOWN_SPAMPROB was 0.2.
          # When that got boosted to 0.5, counting more than once became
          # counterproductive.
!         unique_words = {}
! 
!         wordinfoget = self.wordinfo.get
!         now = time.time()
! 
!         for word in wordstream:
!             if word in unique_words:
!                 continue
!             unique_words[word] = 1
! 
              record = wordinfoget(word)
              if record is None:
--- 160,172 ----
          smallest_best = -1.0
  
+         wordinfoget = self.wordinfo.get
+         now = time.time()
+         mins = []   # all words w/ prob MIN_SPAMPROB
+         maxs = []   # all words w/ prob MAX_SPAMPROB
          # Counting a unique word multiple times hurts, although counting one
          # at most two times had some benefit whan UNKNOWN_SPAMPROB was 0.2.
          # When that got boosted to 0.5, counting more than once became
          # counterproductive.
!         for word in Set(wordstream):
              record = wordinfoget(word)
              if record is None:
***************
*** 164,168 ****
  
              distance = abs(prob - 0.5)
!             if distance > smallest_best:
                  # Subtle:  we didn't use ">" instead of ">=" just to save
                  # calls to heapreplace().  The real intent is that if
--- 177,185 ----
  
              distance = abs(prob - 0.5)
!             if prob == MIN_SPAMPROB:
!                 mins.append((distance, prob, word, record))
!             elif prob == MAX_SPAMPROB:
!                 maxs.append((distance, prob, word, record))
!             elif distance > smallest_best:
                  # Subtle:  we didn't use ">" instead of ">=" just to save
                  # calls to heapreplace().  The real intent is that if
***************
*** 185,190 ****
          # to tend in part to cancel out distortions introduced earlier by
          # HAMBIAS.  Experiments will decide the issue.
!         if evidence:
!             clues = []
          prob_product = inverse_prob_product = 1.0
          for distance, prob, word, record in nbest:
--- 202,225 ----
          # to tend in part to cancel out distortions introduced earlier by
          # HAMBIAS.  Experiments will decide the issue.
!         clues = []
! 
!         # First cancel out competing extreme clues (see comment block at
!         # MAX_DISCRIMINATORS declaration -- this is a twist on Graham).
!         if mins or maxs:
!             if len(mins) < len(maxs):
!                 shorter, longer = mins, maxs
!             else:
!                 shorter, longer = maxs, mins
!             tokeep = min(len(longer) - len(shorter), MAX_DISCRIMINATORS)
!             # They're all good clues, but we're only going to feed the tokeep
!             # initial clues from the longer list into the probability
!             # computation.
!             for dist, prob, word, record in shorter + longer[tokeep:]:
!                 record.killcount += 1
!                 if evidence:
!                     clues.append((word, prob))
!             for x in longer[:tokeep]:
!                 heapreplace(nbest, x)
! 
          prob_product = inverse_prob_product = 1.0
          for distance, prob, word, record in nbest:
***************
*** 195,200 ****
              if evidence:
                  clues.append((word, prob))
-             if self.DEBUG:
-                 print 'nbest P(%r) = %g' % (word, prob)
              prob_product *= prob
              inverse_prob_product *= 1.0 - prob
--- 230,233 ----
***************
*** 298,302 ****
      #
      #    total unique fn went from 292 to 162
! 
      def xspamprob(self, wordstream, evidence=False):
          """Return best-guess probability that wordstream is spam.
--- 331,337 ----
      #
      #    total unique fn went from 292 to 162
!     #
!     # XXX This needs to be updated to incorporate the "cancel out competing
!     # XXX extreme clues" twist.
      def xspamprob(self, wordstream, evidence=False):
          """Return best-guess probability that wordstream is spam.