[Python-checkins] python/nondist/sandbox/spambayes timtest.py,1.10,1.11
tim_one@users.sourceforge.net
tim_one@users.sourceforge.net
Mon, 02 Sep 2002 09:18:56 -0700
Update of /cvsroot/python/python/nondist/sandbox/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv20710
Modified Files:
timtest.py
Log Message:
tokenize_word(): dropped the prefix from the signature; it's faster
to let the caller do it, and this also repaired a bug in one place it
was being used (well, a *conceptual* bug anyway, in that the code didn't
do what I intended there). This changes the stats in an insignificant
way. The f-p stats didn't change. The f-n stats shifted by one message
in a few cases:
false negative percentages
1.091 1.091 tied
0.945 0.945 tied
1.200 1.236 lost
1.454 1.454 tied
1.491 1.491 tied
1.091 1.091 tied
1.091 1.127 lost
1.236 1.236 tied
1.636 1.636 tied
1.382 1.345 won
1.636 1.672 lost
1.599 1.599 tied
1.236 1.236 tied
0.836 0.836 tied
1.018 1.018 tied
1.236 1.236 tied
1.273 1.273 tied
1.055 1.055 tied
1.091 1.091 tied
1.527 1.527 tied
won 1 times
tied 16 times
lost 3 times
total unique unchanged
Index: timtest.py
===================================================================
RCS file: /cvsroot/python/python/nondist/sandbox/spambayes/timtest.py,v
retrieving revision 1.10
retrieving revision 1.11
diff -C2 -d -r1.10 -r1.11
*** timtest.py 2 Sep 2002 09:30:44 -0000 1.10
--- timtest.py 2 Sep 2002 16:18:54 -0000 1.11
***************
*** 182,190 ****
subject_word_re = re.compile(r"[\w\x80-\xff$.%]+")
! def tokenize_word(word, prefix='', _len=len):
n = _len(word)
if 3 <= n <= 12:
! yield prefix + word
elif n > 2:
--- 182,190 ----
subject_word_re = re.compile(r"[\w\x80-\xff$.%]+")
! def tokenize_word(word, _len=len):
n = _len(word)
if 3 <= n <= 12:
! yield word
elif n > 2:
***************
*** 195,208 ****
# XXX generate enough bad 5-grams to dominate the final score.
if has_highbit_char(word):
- prefix += "5g:"
for i in xrange(n-4):
! yield prefix + word[i : i+5]
elif word.count('@') == 1:
# Don't want to skip embedded email addresses.
p1, p2 = word.split('@')
! yield prefix + 'email name:' + p1
for piece in p2.split('.'):
! yield prefix + 'email addr:' + piece
else:
--- 195,207 ----
# XXX generate enough bad 5-grams to dominate the final score.
if has_highbit_char(word):
for i in xrange(n-4):
! yield "5g:" + word[i : i+5]
elif word.count('@') == 1:
# Don't want to skip embedded email addresses.
p1, p2 = word.split('@')
! yield 'email name:' + p1
for piece in p2.split('.'):
! yield 'email addr:' + piece
else:
***************
*** 239,250 ****
subj = msg.get('Subject', '')
for w in subject_word_re.findall(subj):
! for t in tokenize_word(w, 'subject:'):
! yield t
# From:
subj = msg.get('From', '')
for w in subj.lower().split():
! for t in tokenize_word(w, 'from:'):
! yield t
# Find, decode (base64, qp), and tokenize the textual parts of the body.
--- 238,249 ----
subj = msg.get('Subject', '')
for w in subject_word_re.findall(subj):
! for t in tokenize_word(w):
! yield 'subject:' + t
# From:
subj = msg.get('From', '')
for w in subj.lower().split():
! for t in tokenize_word(w):
! yield 'from:' + t
# Find, decode (base64, qp), and tokenize the textual parts of the body.