[Python-checkins] python/nondist/sandbox/spambayes timtest.py,1.9,1.10
tim_one@users.sourceforge.net
tim_one@users.sourceforge.net
Mon, 02 Sep 2002 02:30:46 -0700
Update of /cvsroot/python/python/nondist/sandbox/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv12897
Modified Files:
timtest.py
Log Message:
Don't ask me why this helps -- I don't really know! When skipping "long
words", generating a token with a brief hint about what and how much got
skipped makes a definite improvement in the f-n rate, and doesn't affect
the f-p rate at all. Since experiment said it's a winner, I'm checking
it in. Before (left columan) and after (right column):
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.050 0.050 tied
0.000 0.000 tied
0.025 0.025 tied
0.025 0.025 tied
0.050 0.050 tied
0.025 0.025 tied
0.025 0.025 tied
0.000 0.000 tied
0.075 0.075 tied
0.050 0.050 tied
0.025 0.025 tied
0.000 0.000 tied
0.050 0.050 tied
0.025 0.025 tied
0.025 0.025 tied
0.000 0.000 tied
0.025 0.025 tied
0.050 0.050 tied
won 0 times
tied 20 times
lost 0 times
total unique fp went from 8 to 8
false negative percentages
1.236 1.091 won
1.164 0.945 won
1.454 1.200 won
1.599 1.454 won
1.527 1.491 won
1.236 1.091 won
1.163 1.091 won
1.309 1.236 won
1.891 1.636 won
1.418 1.382 won
1.745 1.636 won
1.708 1.599 won
1.491 1.236 won
0.836 0.836 tied
1.091 1.018 won
1.309 1.236 won
1.491 1.273 won
1.127 1.055 won
1.309 1.091 won
1.636 1.527 won
won 19 times
tied 1 times
lost 0 times
total unique fn went from 336 to 302
Index: timtest.py
===================================================================
RCS file: /cvsroot/python/python/nondist/sandbox/spambayes/timtest.py,v
retrieving revision 1.9
retrieving revision 1.10
diff -C2 -d -r1.9 -r1.10
*** timtest.py 2 Sep 2002 07:55:25 -0000 1.9
--- timtest.py 2 Sep 2002 09:30:44 -0000 1.10
***************
*** 210,216 ****
# For example, it may be an embedded URL (which we already
# tagged), or a uuencoded line.
! # XXX There appears to be some value in generating a token
! # XXX indicating roughly how many chars were skipped.
! pass
def tokenize(string):
--- 210,219 ----
# For example, it may be an embedded URL (which we already
# tagged), or a uuencoded line.
! # There's value in generating a token indicating roughly how
! # many chars were skipped. This has real benefit for the f-n
! # rate, but is neutral for the f-p rate. I don't know why!
! # XXX Figure out why, and/or see if some other way of summarizing
! # XXX this info has greater benefit.
! yield "skipped:%c %d" % (word[0], n // 10 * 10)
def tokenize(string):