[Python-checkins] python/nondist/sandbox/spambayes timtest.py,1.10,1.11

Mon, 02 Sep 2002 09:18:56 -0700

Update of /cvsroot/python/python/nondist/sandbox/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv20710

Modified Files:
	timtest.py 
Log Message:
tokenize_word():  dropped the prefix from the signature; it's faster
to let the caller do it, and this also repaired a bug in one place it
was being used (well, a *conceptual* bug anyway, in that the code didn't
do what I intended there).  This changes the stats in an insignificant
way.  The f-p stats didn't change.  The f-n stats shifted by one message
in a few cases:

false negative percentages
    1.091  1.091  tied
    0.945  0.945  tied
    1.200  1.236  lost
    1.454  1.454  tied
    1.491  1.491  tied
    1.091  1.091  tied
    1.091  1.127  lost
    1.236  1.236  tied
    1.636  1.636  tied
    1.382  1.345  won
    1.636  1.672  lost
    1.599  1.599  tied
    1.236  1.236  tied
    0.836  0.836  tied
    1.018  1.018  tied
    1.236  1.236  tied
    1.273  1.273  tied
    1.055  1.055  tied
    1.091  1.091  tied
    1.527  1.527  tied

won   1 times
tied 16 times
lost  3 times

total unique unchanged


Index: timtest.py
===================================================================
RCS file: /cvsroot/python/python/nondist/sandbox/spambayes/timtest.py,v
retrieving revision 1.10
retrieving revision 1.11
diff -C2 -d -r1.10 -r1.11
*** timtest.py	2 Sep 2002 09:30:44 -0000	1.10
--- timtest.py	2 Sep 2002 16:18:54 -0000	1.11
***************
*** 182,190 ****
  subject_word_re = re.compile(r"[\w\x80-\xff$.%]+")
  
! def tokenize_word(word, prefix='', _len=len):
      n = _len(word)
  
      if 3 <= n <= 12:
!         yield prefix + word
  
      elif n > 2:
--- 182,190 ----
  subject_word_re = re.compile(r"[\w\x80-\xff$.%]+")
  
! def tokenize_word(word, _len=len):
      n = _len(word)
  
      if 3 <= n <= 12:
!         yield word
  
      elif n > 2:
***************
*** 195,208 ****
          # XXX generate enough bad 5-grams to dominate the final score.
          if has_highbit_char(word):
-             prefix += "5g:"
              for i in xrange(n-4):
!                 yield prefix + word[i : i+5]
  
          elif word.count('@') == 1:
              # Don't want to skip embedded email addresses.
              p1, p2 = word.split('@')
!             yield prefix + 'email name:' + p1
              for piece in p2.split('.'):
!                 yield prefix + 'email addr:' + piece
  
          else:
--- 195,207 ----
          # XXX generate enough bad 5-grams to dominate the final score.
          if has_highbit_char(word):
              for i in xrange(n-4):
!                 yield "5g:" + word[i : i+5]
  
          elif word.count('@') == 1:
              # Don't want to skip embedded email addresses.
              p1, p2 = word.split('@')
!             yield 'email name:' + p1
              for piece in p2.split('.'):
!                 yield 'email addr:' + piece
  
          else:
***************
*** 239,250 ****
      subj = msg.get('Subject', '')
      for w in subject_word_re.findall(subj):
!         for t in tokenize_word(w, 'subject:'):
!             yield t
  
      # From:
      subj = msg.get('From', '')
      for w in subj.lower().split():
!         for t in tokenize_word(w, 'from:'):
!             yield t
  
      # Find, decode (base64, qp), and tokenize the textual parts of the body.
--- 238,249 ----
      subj = msg.get('Subject', '')
      for w in subject_word_re.findall(subj):
!         for t in tokenize_word(w):
!             yield 'subject:' + t
  
      # From:
      subj = msg.get('From', '')
      for w in subj.lower().split():
!         for t in tokenize_word(w):
!             yield 'from:' + t
  
      # Find, decode (base64, qp), and tokenize the textual parts of the body.