[Python-checkins] python/nondist/sandbox/spambayes timtest.py,1.11,1.12
tim_one@users.sourceforge.net
tim_one@users.sourceforge.net
Mon, 02 Sep 2002 12:23:43 -0700
Update of /cvsroot/python/python/nondist/sandbox/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv12998
Modified Files:
timtest.py
Log Message:
Made "skip:" tokens shorter.
Added a surprising treatment of Organization headers, with a tiny f-n
benefit for a tiny cost. No change in f-p stats.
false negative percentages
1.091 0.945 won
0.945 0.836 won
1.236 1.200 won
1.454 1.418 won
1.491 1.455 won
1.091 1.091 tied
1.127 1.091 won
1.236 1.236 tied
1.636 1.564 won
1.345 1.236 won
1.672 1.563 won
1.599 1.563 won
1.236 1.236 tied
0.836 0.836 tied
1.018 0.873 won
1.236 1.236 tied
1.273 1.273 tied
1.055 1.018 won
1.091 1.091 tied
1.527 1.490 won
won 13 times
tied 7 times
lost 0 times
total unique fn went from 302 to 292
Index: timtest.py
===================================================================
RCS file: /cvsroot/python/python/nondist/sandbox/spambayes/timtest.py,v
retrieving revision 1.11
retrieving revision 1.12
diff -C2 -d -r1.11 -r1.12
*** timtest.py 2 Sep 2002 16:18:54 -0000 1.11
--- timtest.py 2 Sep 2002 19:23:40 -0000 1.12
***************
*** 214,218 ****
# XXX Figure out why, and/or see if some other way of summarizing
# XXX this info has greater benefit.
! yield "skipped:%c %d" % (word[0], n // 10 * 10)
def tokenize(string):
--- 214,218 ----
# XXX Figure out why, and/or see if some other way of summarizing
# XXX this info has greater benefit.
! yield "skip:%c %d" % (word[0], n // 10 * 10)
def tokenize(string):
***************
*** 236,240 ****
# especially significant in this context. Experiment showed a small
# but real benefit to keeping case intact in this specific context.
! subj = msg.get('Subject', '')
for w in subject_word_re.findall(subj):
for t in tokenize_word(w):
--- 236,240 ----
# especially significant in this context. Experiment showed a small
# but real benefit to keeping case intact in this specific context.
! subj = msg.get('subject', '')
for w in subject_word_re.findall(subj):
for t in tokenize_word(w):
***************
*** 242,249 ****
# From:
! subj = msg.get('From', '')
! for w in subj.lower().split():
! for t in tokenize_word(w):
! yield 'from:' + t
# Find, decode (base64, qp), and tokenize the textual parts of the body.
--- 242,259 ----
# From:
! for field in ('from',):
! prefix = field + ':'
! subj = msg.get(field, '')
! for w in subj.lower().split():
! for t in tokenize_word(w):
! yield prefix + t
!
! # Organization:
! # Oddly enough, tokenizing this doesn't make any difference to results.
! # However, noting its mere absence is strong enough to give a tiny
! # improvement in the f-n rate, and since recording that requires only
! # one token across the whole database, the cost is also tiny.
! if msg.get('organization', None) is None:
! yield "bool:noorg"
# Find, decode (base64, qp), and tokenize the textual parts of the body.