[Python-checkins] python/nondist/sandbox/spambayes timtest.py,1.11,1.12

Mon, 02 Sep 2002 12:23:43 -0700

Update of /cvsroot/python/python/nondist/sandbox/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv12998

Modified Files:
	timtest.py 
Log Message:
Made "skip:" tokens shorter.

Added a surprising treatment of Organization headers, with a tiny f-n
benefit for a tiny cost.  No change in f-p stats.

false negative percentages
    1.091  0.945  won
    0.945  0.836  won
    1.236  1.200  won
    1.454  1.418  won
    1.491  1.455  won
    1.091  1.091  tied
    1.127  1.091  won
    1.236  1.236  tied
    1.636  1.564  won
    1.345  1.236  won
    1.672  1.563  won
    1.599  1.563  won
    1.236  1.236  tied
    0.836  0.836  tied
    1.018  0.873  won
    1.236  1.236  tied
    1.273  1.273  tied
    1.055  1.018  won
    1.091  1.091  tied
    1.527  1.490  won

won  13 times
tied  7 times
lost  0 times

total unique fn went from 302 to 292

Index: timtest.py
===================================================================
RCS file: /cvsroot/python/python/nondist/sandbox/spambayes/timtest.py,v
retrieving revision 1.11
retrieving revision 1.12
diff -C2 -d -r1.11 -r1.12
*** timtest.py	2 Sep 2002 16:18:54 -0000	1.11
--- timtest.py	2 Sep 2002 19:23:40 -0000	1.12
***************
*** 214,218 ****
              # XXX Figure out why, and/or see if some other way of summarizing
              # XXX this info has greater benefit.
!             yield "skipped:%c %d" % (word[0], n // 10 * 10)

  def tokenize(string):
--- 214,218 ----
              # XXX Figure out why, and/or see if some other way of summarizing
              # XXX this info has greater benefit.
!             yield "skip:%c %d" % (word[0], n // 10 * 10)

  def tokenize(string):
***************
*** 236,240 ****
      # especially significant in this context.  Experiment showed a small
      # but real benefit to keeping case intact in this specific context.
!     subj = msg.get('Subject', '')
      for w in subject_word_re.findall(subj):
          for t in tokenize_word(w):
--- 236,240 ----
      # especially significant in this context.  Experiment showed a small
      # but real benefit to keeping case intact in this specific context.
!     subj = msg.get('subject', '')
      for w in subject_word_re.findall(subj):
          for t in tokenize_word(w):
***************
*** 242,249 ****

      # From:
!     subj = msg.get('From', '')
!     for w in subj.lower().split():
!         for t in tokenize_word(w):
!             yield 'from:' + t

      # Find, decode (base64, qp), and tokenize the textual parts of the body.
--- 242,259 ----

      # From:
!     for field in ('from',):
!         prefix = field + ':'
!         subj = msg.get(field, '')
!         for w in subj.lower().split():
!             for t in tokenize_word(w):
!                 yield prefix + t
! 
!     # Organization:
!     # Oddly enough, tokenizing this doesn't make any difference to results.
!     # However, noting its mere absence is strong enough to give a tiny
!     # improvement in the f-n rate, and since recording that requires only
!     # one token across the whole database, the cost is also tiny.
!     if msg.get('organization', None) is None:
!         yield "bool:noorg"

      # Find, decode (base64, qp), and tokenize the textual parts of the body.