[Python-checkins] python/nondist/sandbox/spambayes timtest.py,1.9,1.10

Mon, 02 Sep 2002 02:30:46 -0700

Update of /cvsroot/python/python/nondist/sandbox/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv12897

Modified Files:
	timtest.py 
Log Message:
Don't ask me why this helps -- I don't really know!  When skipping "long
words", generating a token with a brief hint about what and how much got
skipped makes a definite improvement in the f-n rate, and doesn't affect
the f-p rate at all.  Since experiment said it's a winner, I'm checking
it in.  Before (left columan) and after (right column):

false positive percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.050  0.050  tied
    0.000  0.000  tied
    0.025  0.025  tied
    0.025  0.025  tied
    0.050  0.050  tied
    0.025  0.025  tied
    0.025  0.025  tied
    0.000  0.000  tied
    0.075  0.075  tied
    0.050  0.050  tied
    0.025  0.025  tied
    0.000  0.000  tied
    0.050  0.050  tied
    0.025  0.025  tied
    0.025  0.025  tied
    0.000  0.000  tied
    0.025  0.025  tied
    0.050  0.050  tied

won   0 times
tied 20 times
lost  0 times

total unique fp went from 8 to 8

false negative percentages
    1.236  1.091  won
    1.164  0.945  won
    1.454  1.200  won
    1.599  1.454  won
    1.527  1.491  won
    1.236  1.091  won
    1.163  1.091  won
    1.309  1.236  won
    1.891  1.636  won
    1.418  1.382  won
    1.745  1.636  won
    1.708  1.599  won
    1.491  1.236  won
    0.836  0.836  tied
    1.091  1.018  won
    1.309  1.236  won
    1.491  1.273  won
    1.127  1.055  won
    1.309  1.091  won
    1.636  1.527  won

won  19 times
tied  1 times
lost  0 times

total unique fn went from 336 to 302


Index: timtest.py
===================================================================
RCS file: /cvsroot/python/python/nondist/sandbox/spambayes/timtest.py,v
retrieving revision 1.9
retrieving revision 1.10
diff -C2 -d -r1.9 -r1.10
*** timtest.py	2 Sep 2002 07:55:25 -0000	1.9
--- timtest.py	2 Sep 2002 09:30:44 -0000	1.10
***************
*** 210,216 ****
              # For example, it may be an embedded URL (which we already
              # tagged), or a uuencoded line.
!             # XXX There appears to be some value in generating a token
!             # XXX indicating roughly how many chars were skipped.
!             pass
  
  def tokenize(string):
--- 210,219 ----
              # For example, it may be an embedded URL (which we already
              # tagged), or a uuencoded line.
!             # There's value in generating a token indicating roughly how
!             # many chars were skipped.  This has real benefit for the f-n
!             # rate, but is neutral for the f-p rate.  I don't know why!
!             # XXX Figure out why, and/or see if some other way of summarizing
!             # XXX this info has greater benefit.
!             yield "skipped:%c %d" % (word[0], n // 10 * 10)
  
  def tokenize(string):