[Spambayes-checkins] spambayes timtest.py,1.13,1.14
tokenizer.py,1.8,1.9
Tim Peters
tim_one@users.sourceforge.net
Sun, 08 Sep 2002 16:48:52 -0700
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv10431
Modified Files:
timtest.py tokenizer.py
Log Message:
Tried to treat src= params specially. It made no difference, so left
the code but commented it out. Refactored code to parse "file names"
as part of this, and left that change in.
Index: timtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtest.py,v
retrieving revision 1.13
retrieving revision 1.14
diff -C2 -d -r1.13 -r1.14
*** timtest.py 8 Sep 2002 21:08:16 -0000 1.13
--- timtest.py 8 Sep 2002 23:48:50 -0000 1.14
***************
*** 141,147 ****
self.trained_spam_hist = Hist(self.nbuckets)
! #f = file('w.pik', 'wb')
! #pickle.dump(self.classifier, f, 1)
! #f.close()
#import sys
#sys.exit(0)
--- 141,147 ----
self.trained_spam_hist = Hist(self.nbuckets)
! f = file('w.pik', 'wb')
! pickle.dump(self.classifier, f, 1)
! f.close()
#import sys
#sys.exit(0)
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.8
retrieving revision 1.9
diff -C2 -d -r1.8 -r1.9
*** tokenizer.py 8 Sep 2002 21:29:05 -0000 1.8
--- tokenizer.py 8 Sep 2002 23:48:50 -0000 1.9
***************
*** 558,561 ****
--- 558,587 ----
subject_word_re = re.compile(r"[\w\x80-\xff$.%]+")
+ # Anthony Baxter reported goodness from cracking src params.
+ # Finding a src= thingie is complicated if we insist it appear in an
+ # img or iframe tag, so this approximates reality with a fast and
+ # non-stack-blowing simple regexp.
+ src_re = re.compile(r"""
+ \s
+ src=['"]
+ (?!https?:) # we suck out http thingies via a different gimmick
+ ([^'"]{1,128}) # capture the guts, but don't go wild
+ ['"]
+ """, re.VERBOSE)
+
+ fname_sep_re = re.compile(r'[/\\:]')
+
+ def crack_filename(fname):
+ yield "fname:" + fname
+ components = fname_sep_re.split(fname)
+ morethan1 = len(components) > 1
+ for component in components:
+ if morethan1:
+ yield "fname comp:" + component
+ pieces = urlsep_re.split(component)
+ if len(pieces) > 1:
+ for piece in pieces:
+ yield "fname piece:" + piece
+
def tokenize_word(word, _len=len):
n = _len(word)
***************
*** 701,707 ****
fname = msg.get_filename()
if fname is not None:
! for x in fname.lower().split('/'):
! for y in x.split('.'):
! yield 'filename:' + y
if 0: # disabled; see comment before function
--- 727,732 ----
fname = msg.get_filename()
if fname is not None:
! for x in crack_filename(fname):
! yield 'filename:' + x
if 0: # disabled; see comment before function
***************
*** 874,877 ****
--- 899,913 ----
for chunk in urlsep_re.split(piece):
yield prefix + chunk
+
+ # Anthony Baxter reported goodness from tokenizing src= params.
+ # XXX This made no difference in my tests: both error rates
+ # XXX across 20 runs were identical before and after. I suspect
+ # XXX this is because Anthony got most good out of the http
+ # XXX thingies in <img src="http://bozo.bozo.com">, but we
+ # XXX picked those up in the last step (in src params and
+ # XXX everywhere else). So this code is commented out.
+ ## for fname in src_re.findall(text):
+ ## for x in crack_filename(fname):
+ ## yield "src:" + x
# Remove HTML/XML tags if it's a plain text message.