[Spambayes-checkins] spambayes/spambayes Options.py, 1.92, 1.93 classifier.py, 1.17, 1.18

Wed Dec 17 00:43:44 EST 2003

Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv524/spambayes

Modified Files:
	Options.py classifier.py 
Log Message:
Fulfilling a promise to write useful comments for the x-use_bigrams option.

Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/Options.py,v
retrieving revision 1.92
retrieving revision 1.93
diff -C2 -d -r1.92 -r1.93
*** Options.py	16 Dec 2003 05:06:34 -0000	1.92
--- Options.py	17 Dec 2003 05:43:42 -0000	1.93
***************
*** 429,446 ****

      ("x-use_bigrams", "(EXPERIMENTAL) Use mixed uni/bi-grams scheme", False,
!      """Enabling this option means that SpamBayes will generate both
!      unigrams (words) and bigrams (pairs of words). However, Graham's
!      scheme is also used, where, for each clue triplet of two unigrams and
!      the bigram they make, the clue that is strongest (i.e. has a
!      probability furtherest from 0.5) is used, and the other two are not.

       Note that to really test this option you need to retrain with it on,
       so that your database includes the bigrams - if you subsequently turn
!      it off, these tokens will have no effect.

!      Note also that you should probably also increase the max_discriminators
!      (Maximum number of extreme words) option if you enable this option;
!      this may need to be doubled or quadrupled to see the benefit from the
!      bigrams.

       This option is experimental, and may be removed in a future release.
--- 429,452 ----

      ("x-use_bigrams", "(EXPERIMENTAL) Use mixed uni/bi-grams scheme", False,
!      """Generate both unigrams (words) and bigrams (pairs of words).
!      However, extending an idea originally from Gary Robinson, the message
!      is 'tiled' into  non-overlapping unigrams and bigrams, approximating
!      the strongest outcome over all possible tilings.

       Note that to really test this option you need to retrain with it on,
       so that your database includes the bigrams - if you subsequently turn
!      it off, these tokens will have no effect.  This option will at least
!      double your database size given the same training data, and will
!      probably at least triple it.

!      You may also wish to increase the max_discriminators (maximum number
!      of extreme words) option if you enable this option, perhaps doubling or
!      quadrupling it.  It's not yet clear.  Bigrams create many more hapaxes,
!      and that seems to increase the brittleness of minimalist training
!      regimes; increasing max_discriminators may help to soften that effect.
!      OTOH, max_discriminators defaults to 150 in part because that makes it
!      easy to prove that the chi-squared math is immune from numeric
!      problems.  Increase it too much, and insane results will eventually
!      result (including fatal floating-point exceptions on some boxes).

       This option is experimental, and may be removed in a future release.

Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/classifier.py,v
retrieving revision 1.17
retrieving revision 1.18
diff -C2 -d -r1.17 -r1.18
*** classifier.py	17 Dec 2003 02:05:33 -0000	1.17
--- classifier.py	17 Dec 2003 05:43:42 -0000	1.18
***************
*** 359,376 ****
          pass

      def _getclues(self, wordstream):
          mindist = options["Classifier", "minimum_prob_strength"]

          if options["Classifier", "x-use_bigrams"]:
              raw = []
              push = raw.append
              pair = None
!             seen = {pair: 1}
              for i, token in enumerate(wordstream):
!                 if i:
                      pair = "bi:%s %s" % (last_token, token)
                  last_token = token
                  for clue, indices in (token, (i,)), (pair, (i-1, i)):
!                     if clue not in seen:
                          seen[clue] = 1
                          tup = self._worddistanceget(clue)
--- 359,405 ----
          pass

+     # Return list of (prob, word, record) triples, sorted by increasing
+     # prob.  "word" is a token from wordstream; "prob" is its spamprob (a
+     # float in 0.0 through 1.0); and "record" is word's associated
+     # WordInfo record if word is in the training database, or None if it's
+     # not.  No more than max_discriminators items are returned, and have
+     # the strongest (farthest from 0.5) spamprobs of all tokens in wordstream.
+     # Tokens with spamprobs less than minimum_prob_strength away from 0.5
+     # aren't returned.
      def _getclues(self, wordstream):
          mindist = options["Classifier", "minimum_prob_strength"]

          if options["Classifier", "x-use_bigrams"]:
+             # This scheme mixes single tokens with pairs of adjacent tokens.
+             # wordstream is "tiled" into non-overlapping unigrams and
+             # bigrams.  Non-overlap is important to prevent a single original
+             # token from contributing to more than one spamprob returned
+             # (systematic correlation probably isn't a good thing).
+ 
+             # First fill list raw with
+             #     (distance, prob, word, record), indices
+             # pairs, one for each unigram and bigram in wordstream.
+             # indices is a tuple containing the indices (0-based relative to
+             # the start of wordstream) of the tokens that went into word.
+             # indices is a 1-tuple for an original token, and a 2-tuple for
+             # a synthesized bigram token.  The indices are needed to detect
+             # overlap later.
              raw = []
              push = raw.append
              pair = None
!             # Keep track of which tokens we've already seen.
!             # Don't use a Set here!  This is an innermost loop, so speed is
!             # important here (direct dict fiddling is much quicker than
!             # invoking Python-level Set methods; in Python 2.4 that will
!             # change).
!             seen = {pair: 1} # so the bigram token is skipped on 1st loop trip
              for i, token in enumerate(wordstream):
!                 if i:   # not the 1st loop trip, so there is a preceding token
!                     # This string interpolation must match the one in
!                     # _enhance_wordstream().
                      pair = "bi:%s %s" % (last_token, token)
                  last_token = token
                  for clue, indices in (token, (i,)), (pair, (i-1, i)):
!                     if clue not in seen:    # as always, skip duplicates
                          seen[clue] = 1
                          tup = self._worddistanceget(clue)
***************
*** 378,395 ****
                              push((tup, indices))

              raw.sort()
              raw.reverse()
              clues = []
              push = clues.append
              seen = {}
              for tup, indices in raw:
                  overlap = [i for i in indices if i in seen]
!                 if not overlap:
                      for i in indices:
                          seen[i] = 1
                      push(tup)
              clues.reverse()

          else:
              clues = []
              push = clues.append
--- 407,431 ----
                              push((tup, indices))

+             # Sort raw, strongest to weakest spamprob.
              raw.sort()
              raw.reverse()
+             # Fill clues with the strongest non-overlapping clues.
              clues = []
              push = clues.append
+             # Keep track of which indices have already contributed to a
+             # clue in clues.
              seen = {}
              for tup, indices in raw:
                  overlap = [i for i in indices if i in seen]
!                 if not overlap: # no overlap with anything already in clues
                      for i in indices:
                          seen[i] = 1
                      push(tup)
+             # Leave sorted from smallest to largest spamprob.
              clues.reverse()

          else:
+             # The all-unigram scheme just scores the tokens as-is.  A Set()
+             # is used to weed out duplicates at high speed.
              clues = []
              push = clues.append
***************
*** 443,446 ****
--- 479,484 ----
              yield token
              if last:
+                 # This string interpolation must match the one in
+                 # _getclues().
                  yield "bi:%s %s" % (last, token)
              last = token