[Python-checkins] python/nondist/sandbox/spambayes timtest.py,1.16,1.17

tim_one@users.sourceforge.net tim_one@users.sourceforge.net
Wed, 04 Sep 2002 21:32:24 -0700


Update of /cvsroot/python/python/nondist/sandbox/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv1336

Modified Files:
	timtest.py 
Log Message:
Added note about word length.


Index: timtest.py
===================================================================
RCS file: /cvsroot/python/python/nondist/sandbox/spambayes/timtest.py,v
retrieving revision 1.16
retrieving revision 1.17
diff -C2 -d -r1.16 -r1.17
*** timtest.py	5 Sep 2002 03:48:28 -0000	1.16
--- timtest.py	5 Sep 2002 04:32:22 -0000	1.17
***************
*** 436,443 ****
      n = _len(word)
  
      if 3 <= n <= 12:
          yield word
  
!     elif n > 2:
          # A long word.
  
--- 436,450 ----
      n = _len(word)
  
+     # XXX How big should "a word" be?
+     # XXX I expect 12 is fine -- a test run boosting to 13 had no effect
+     # XXX on f-p rate, and did a little better or worse than 12 across
+     # XXX runs -- overall, no significant difference.  It's only "common
+     # XXX sense" so far driving the exclusion of lengths 1 and 2.
+ 
+     # Make sure this range matches in tokenize().
      if 3 <= n <= 12:
          yield word
  
!     elif n >= 3:
          # A long word.
  
***************
*** 555,558 ****
--- 562,566 ----
          for w in text.split():
              n = len(w)
+             # Make sure this range matches in tokenize_word().
              if 3 <= n <= 12:
                  yield w