[Python-checkins] python/nondist/sandbox/spambayes timtest.py,1.7,1.8

tim_one@users.sourceforge.net tim_one@users.sourceforge.net
Sun, 01 Sep 2002 18:18:19 -0700


Update of /cvsroot/python/python/nondist/sandbox/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv312

Modified Files:
	timtest.py 
Log Message:
Fixed some out-of-date comments.

Made URL clumping lumpier:  now distinguishes among just "first field",
"second field", and "everything else".

Changed tag names for email address fields (semantically neutral).

Added "From:" line tagging.

These add up to an almost pure win.  Before-and-after f-n rates across 20
runs:

1.418   1.236
1.309   1.164
1.636   1.454
1.854   1.599
1.745   1.527
1.418   1.236
1.381   1.163
1.418   1.309
2.109   1.891
1.491   1.418
1.854   1.745
1.890   1.708
1.818   1.491
1.055   0.836
1.164   1.091
1.599   1.309
1.600   1.491
1.127   1.127
1.164   1.309
1.781   1.636

It only increased in one run.  The variance appears to have been reduced
too (I didn't bother to compute that, though).

Before-and-after f-p rates across 20 runs:

0.000   0.000   
0.000   0.000   
0.075   0.050   
0.000   0.000   
0.025   0.025   
0.050   0.025   
0.075   0.050   
0.025   0.025   
0.025   0.025   
0.025   0.000   
0.100   0.075   
0.050   0.050   
0.025   0.025   
0.000   0.000   
0.075   0.050   
0.025   0.025   
0.025   0.025   
0.000   0.000   
0.075   0.025   
0.100   0.050   

Note that 0.025% is a single message; it's really impossible to *measure*
an improvement in the f-p rate anymore with 4000-msg ham sets.

Across all 20 runs,

the total # of unique f-n fell from 353 to 336
the total # of unique f-p fell from 13 to 8


Index: timtest.py
===================================================================
RCS file: /cvsroot/python/python/nondist/sandbox/spambayes/timtest.py,v
retrieving revision 1.7
retrieving revision 1.8
diff -C2 -d -r1.7 -r1.8
*** timtest.py	2 Sep 2002 00:06:34 -0000	1.7
--- timtest.py	2 Sep 2002 01:18:17 -0000	1.8
***************
*** 40,49 ****
                      stack.extend(subpart.get_payload())
  
-             # XXX This comment turned out to be false.  Gave an example to
-             # XXX Barry because it feels like a bug that it's false.  The
-             # XXX code has been changed to worm around it until it's resolved.
-             #     """If only textpart was found, the main walk() will
-             #        eventually add it to text.
-             #     """
              if textpart is not None:
                  text.add(textpart)
--- 40,43 ----
***************
*** 154,161 ****
  # gimmick kicks in, and produces no tokens at all.
  #
! # XXX Try producing character n-grams then under the s-o-w scheme, instead
! # XXX of ignoring the blob.  This was too unattractive before because we
! # XXX weren't decoding base64 or qp.  We're still not decoding uuencoded
! # XXX stuff.  So try this only if there are high-bit characters in the blob.
  #
  # Interesting:  despite that odd example above, the *kinds* of f-p mistakes
--- 148,155 ----
  # gimmick kicks in, and produces no tokens at all.
  #
! # [Later:  we produce character 5-grams then under the s-o-w scheme, instead
! # ignoring the blob, but only if there are high-bit characters in the blob;
! # e.g., there's no point 5-gramming uuencoded lines, and doing so would
! # bloat the database size.]
  #
  # Interesting:  despite that odd example above, the *kinds* of f-p mistakes
***************
*** 205,219 ****
              # Don't want to skip embedded email addresses.
              p1, p2 = word.split('@')
!             yield prefix + 'email0:' + p1
              for piece in p2.split('.'):
!                 yield prefix + 'email1:' + piece
  
          else:
              # It's a long string of "normal" chars.  Ignore it.
              # For example, it may be an embedded URL (which we already
!             # tagged), or a uuencoded line.  Curiously, it helps to generate
!             # a token telling roughly how many chars were skipped!  This
!             # rounds up to the nearest multiple of 10.
!             pass#yield prefix + "skipped:" + str((n + 9)//10)
  
  def tokenize(string):
--- 199,213 ----
              # Don't want to skip embedded email addresses.
              p1, p2 = word.split('@')
!             yield prefix + 'email name:' + p1
              for piece in p2.split('.'):
!                 yield prefix + 'email addr:' + piece
  
          else:
              # It's a long string of "normal" chars.  Ignore it.
              # For example, it may be an embedded URL (which we already
!             # tagged), or a uuencoded line.
!             # XXX There appears to be some value in generating a token
!             # XXX indicating roughly how many chars were skipped.
!             pass
  
  def tokenize(string):
***************
*** 230,235 ****
      # XXX The headers in my spam and ham corpora are so different (they came
      # XXX from different sources) that if I include them the classifier's
!     # XXX job is trivial.  But the Subject lines are OK, so use them.
  
      # Don't ignore case in Subject lines; e.g., 'free' versus 'FREE' is
      # especially significant in this context.
--- 224,231 ----
      # XXX The headers in my spam and ham corpora are so different (they came
      # XXX from different sources) that if I include them the classifier's
!     # XXX job is trivial.  Only some "safe" header lines are included here,
!     # XXX where "safe" is specific to my sorry <wink> corpora.
  
+     # Subject:
      # Don't ignore case in Subject lines; e.g., 'free' versus 'FREE' is
      # especially significant in this context.
***************
*** 240,243 ****
--- 236,246 ----
                  yield t
  
+     # From:
+     subj = msg.get('From', None)
+     if subj:
+         for w in subj.lower().split():
+             for t in tokenize_word(w, 'from:'):
+                 yield t
+ 
      # Find, decode (base64, qp), and tokenize the textual parts of the body.
      for part in textparts(msg):
***************
*** 267,271 ****
                  guts = guts[:-1]
              for i, piece in enumerate(guts.split('/')):
!                 prefix = "%s%d:" % (proto, i)
                  for chunk in urlsep_re.split(piece):
                      yield prefix + chunk
--- 270,274 ----
                  guts = guts[:-1]
              for i, piece in enumerate(guts.split('/')):
!                 prefix = "%s%s:" % (proto, i < 2 and str(i) or '>1')
                  for chunk in urlsep_re.split(piece):
                      yield prefix + chunk
***************
*** 288,295 ****
          guts = f.read()
          f.close()
- #        # Skip the headers.
- #        i = guts.find('\n\n')
- #        if i >= 0:
- #            guts = guts[i+2:]
          self.guts = guts
  
--- 291,294 ----
***************
*** 352,355 ****
--- 351,355 ----
                  for clue in clues:
                      print "prob(%r) = %g" % clue
+                 print
                  print e.guts
  
***************
*** 364,367 ****
--- 364,368 ----
                  for clue in clues:
                      print "prob(%r) = %g" % clue
+                 print
                  print e.guts[:1000]