[Spambayes-checkins] spambayes tokenizer.py,1.1,1.2
timtoken.py,1.8,NONE
Tim Peters
tim_one@users.sourceforge.net
Sat, 07 Sep 2002 11:38:13 -0700
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv19837
Modified Files:
tokenizer.py
Removed Files:
timtoken.py
Log Message:
Removed timtoken.py from the project. tokenizer.py is essentially a
copy, but of a somewhat out-of-date version of timtoken at the time
it was introduced. The differences are all in comments, and I found
those and put them back into tokenizer.py.
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** tokenizer.py 7 Sep 2002 16:14:09 -0000 1.1
--- tokenizer.py 7 Sep 2002 18:38:10 -0000 1.2
***************
*** 352,355 ****
--- 352,375 ----
# XXX not to strip HTML from HTML-only msgs should be revisited.
+ ##############################################################################
+ # How big should "a word" be?
+ #
+ # As I write this, words less than 3 chars are ignored completely, and words
+ # with more than 12 are special-cased, replaced with a summary "I skipped
+ # about so-and-so many chars starting with such-and-such a letter" token.
+ # This makes sense for English if most of the info is in "regular size"
+ # words.
+ #
+ # A test run boosting to 13 had no effect on f-p rate, and did a little
+ # better or worse than 12 across runs -- overall, no significant difference.
+ # The database size is smaller at 12, so there's nothing in favor of 13.
+ # A test at 11 showed a slight but consistent bad effect on the f-n rate
+ # (lost 12 times, won once, tied 7 times).
+ #
+ # A test with no lower bound showed a significant increase in the f-n rate.
+ # Curious, but not worth digging into. Boosting the lower bound to 4 is a
+ # worse idea: f-p and f-n rates both suffered significantly then. I didn't
+ # try testing with lower bound 2.
+
url_re = re.compile(r"""
(https? | ftp) # capture the protocol
***************
*** 383,392 ****
n = _len(word)
- # XXX How big should "a word" be?
- # XXX I expect 12 is fine -- a test run boosting to 13 had no effect
- # XXX on f-p rate, and did a little better or worse than 12 across
- # XXX runs -- overall, no significant difference. It's only "common
- # XXX sense" so far driving the exclusion of lengths 1 and 2.
-
# Make sure this range matches in tokenize().
if 3 <= n <= 12:
--- 403,406 ----
***************
*** 449,453 ****
#
# A bug in this code prevented Content-Transfer-Encoding from getting
! # picked up. Fixing that bug showed that it didn't helpe, so the corrected
# code is disabled now (left column without Content-Transfer-Encoding,
# right column with it);
--- 463,467 ----
#
# A bug in this code prevented Content-Transfer-Encoding from getting
! # picked up. Fixing that bug showed that it didn't help, so the corrected
# code is disabled now (left column without Content-Transfer-Encoding,
# right column with it);
***************
*** 567,571 ****
def tokenize_headers(self, msg):
# Special tagging of header lines.
!
# XXX TODO Neil Schemenauer has gotten a good start on this
# XXX (pvt email). The headers in my spam and ham corpora are
--- 581,585 ----
def tokenize_headers(self, msg):
# Special tagging of header lines.
!
# XXX TODO Neil Schemenauer has gotten a good start on this
# XXX (pvt email). The headers in my spam and ham corpora are
--- timtoken.py DELETED ---