[Python-checkins] python/nondist/sandbox/spambayes timtest.py,1.16,1.17
tim_one@users.sourceforge.net
tim_one@users.sourceforge.net
Wed, 04 Sep 2002 21:32:24 -0700
Update of /cvsroot/python/python/nondist/sandbox/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv1336
Modified Files:
timtest.py
Log Message:
Added note about word length.
Index: timtest.py
===================================================================
RCS file: /cvsroot/python/python/nondist/sandbox/spambayes/timtest.py,v
retrieving revision 1.16
retrieving revision 1.17
diff -C2 -d -r1.16 -r1.17
*** timtest.py 5 Sep 2002 03:48:28 -0000 1.16
--- timtest.py 5 Sep 2002 04:32:22 -0000 1.17
***************
*** 436,443 ****
n = _len(word)
if 3 <= n <= 12:
yield word
! elif n > 2:
# A long word.
--- 436,450 ----
n = _len(word)
+ # XXX How big should "a word" be?
+ # XXX I expect 12 is fine -- a test run boosting to 13 had no effect
+ # XXX on f-p rate, and did a little better or worse than 12 across
+ # XXX runs -- overall, no significant difference. It's only "common
+ # XXX sense" so far driving the exclusion of lengths 1 and 2.
+
+ # Make sure this range matches in tokenize().
if 3 <= n <= 12:
yield word
! elif n >= 3:
# A long word.
***************
*** 555,558 ****
--- 562,566 ----
for w in text.split():
n = len(w)
+ # Make sure this range matches in tokenize_word().
if 3 <= n <= 12:
yield w