[Spambayes-checkins] spambayes tokenizer.py,1.1,1.2 timtoken.py,1.8,NONE

Sat, 07 Sep 2002 11:38:13 -0700

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv19837

Modified Files:
	tokenizer.py 
Removed Files:
	timtoken.py 
Log Message:
Removed timtoken.py from the project.  tokenizer.py is essentially a
copy, but of a somewhat out-of-date version of timtoken at the time
it was introduced.  The differences are all in comments, and I found
those and put them back into tokenizer.py.


Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** tokenizer.py	7 Sep 2002 16:14:09 -0000	1.1
--- tokenizer.py	7 Sep 2002 18:38:10 -0000	1.2
***************
*** 352,355 ****
--- 352,375 ----
  # XXX not to strip HTML from HTML-only msgs should be revisited.
  
+ ##############################################################################
+ # How big should "a word" be?
+ #
+ # As I write this, words less than 3 chars are ignored completely, and words
+ # with more than 12 are special-cased, replaced with a summary "I skipped
+ # about so-and-so many chars starting with such-and-such a letter" token.
+ # This makes sense for English if most of the info is in "regular size"
+ # words.
+ #
+ # A test run boosting to 13 had no effect on f-p rate, and did a little
+ # better or worse than 12 across runs -- overall, no significant difference.
+ # The database size is smaller at 12, so there's nothing in favor of 13.
+ # A test at 11 showed a slight but consistent bad effect on the f-n rate
+ # (lost 12 times, won once, tied 7 times).
+ #
+ # A test with no lower bound showed a significant increase in the f-n rate.
+ # Curious, but not worth digging into.  Boosting the lower bound to 4 is a
+ # worse idea:  f-p and f-n rates both suffered significantly then.  I didn't
+ # try testing with lower bound 2.
+ 
  url_re = re.compile(r"""
      (https? | ftp)  # capture the protocol
***************
*** 383,392 ****
      n = _len(word)
  
-     # XXX How big should "a word" be?
-     # XXX I expect 12 is fine -- a test run boosting to 13 had no effect
-     # XXX on f-p rate, and did a little better or worse than 12 across
-     # XXX runs -- overall, no significant difference.  It's only "common
-     # XXX sense" so far driving the exclusion of lengths 1 and 2.
- 
      # Make sure this range matches in tokenize().
      if 3 <= n <= 12:
--- 403,406 ----
***************
*** 449,453 ****
  #
  # A bug in this code prevented Content-Transfer-Encoding from getting
! # picked up.  Fixing that bug showed that it didn't helpe, so the corrected
  # code is disabled now (left column without Content-Transfer-Encoding,
  # right column with it);
--- 463,467 ----
  #
  # A bug in this code prevented Content-Transfer-Encoding from getting
! # picked up.  Fixing that bug showed that it didn't help, so the corrected
  # code is disabled now (left column without Content-Transfer-Encoding,
  # right column with it);
***************
*** 567,571 ****
      def tokenize_headers(self, msg):
          # Special tagging of header lines.
!         
          # XXX TODO Neil Schemenauer has gotten a good start on this
          # XXX (pvt email).  The headers in my spam and ham corpora are
--- 581,585 ----
      def tokenize_headers(self, msg):
          # Special tagging of header lines.
! 
          # XXX TODO Neil Schemenauer has gotten a good start on this
          # XXX (pvt email).  The headers in my spam and ham corpora are

--- timtoken.py DELETED ---