[Spambayes-checkins] spambayes timtest.py,1.13,1.14 tokenizer.py,1.8,1.9

Sun, 08 Sep 2002 16:48:52 -0700

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv10431

Modified Files:
	timtest.py tokenizer.py 
Log Message:
Tried to treat src= params specially.  It made no difference, so left
the code but commented it out.  Refactored code to parse "file names"
as part of this, and left that change in.


Index: timtest.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/timtest.py,v
retrieving revision 1.13
retrieving revision 1.14
diff -C2 -d -r1.13 -r1.14
*** timtest.py	8 Sep 2002 21:08:16 -0000	1.13
--- timtest.py	8 Sep 2002 23:48:50 -0000	1.14
***************
*** 141,147 ****
          self.trained_spam_hist = Hist(self.nbuckets)
  
!         #f = file('w.pik', 'wb')
!         #pickle.dump(self.classifier, f, 1)
!         #f.close()
          #import sys
          #sys.exit(0)
--- 141,147 ----
          self.trained_spam_hist = Hist(self.nbuckets)
  
!         f = file('w.pik', 'wb')
!         pickle.dump(self.classifier, f, 1)
!         f.close()
          #import sys
          #sys.exit(0)

Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.8
retrieving revision 1.9
diff -C2 -d -r1.8 -r1.9
*** tokenizer.py	8 Sep 2002 21:29:05 -0000	1.8
--- tokenizer.py	8 Sep 2002 23:48:50 -0000	1.9
***************
*** 558,561 ****
--- 558,587 ----
  subject_word_re = re.compile(r"[\w\x80-\xff$.%]+")
  
+ # Anthony Baxter reported goodness from cracking src params.
+ # Finding a src= thingie is complicated if we insist it appear in an
+ # img or iframe tag, so this approximates reality with a fast and
+ # non-stack-blowing simple regexp.
+ src_re = re.compile(r"""
+     \s
+     src=['"]
+     (?!https?:)     # we suck out http thingies via a different gimmick
+     ([^'"]{1,128})  # capture the guts, but don't go wild
+     ['"]
+ """, re.VERBOSE)
+ 
+ fname_sep_re = re.compile(r'[/\\:]')
+ 
+ def crack_filename(fname):
+     yield "fname:" + fname
+     components = fname_sep_re.split(fname)
+     morethan1 = len(components) > 1
+     for component in components:
+         if morethan1:
+             yield "fname comp:" + component
+         pieces = urlsep_re.split(component)
+         if len(pieces) > 1:
+             for piece in pieces:
+                 yield "fname piece:" + piece
+ 
  def tokenize_word(word, _len=len):
      n = _len(word)
***************
*** 701,707 ****
      fname = msg.get_filename()
      if fname is not None:
!         for x in fname.lower().split('/'):
!             for y in x.split('.'):
!                 yield 'filename:' + y
  
      if 0:   # disabled; see comment before function
--- 727,732 ----
      fname = msg.get_filename()
      if fname is not None:
!         for x in crack_filename(fname):
!             yield 'filename:' + x
  
      if 0:   # disabled; see comment before function
***************
*** 874,877 ****
--- 899,913 ----
                      for chunk in urlsep_re.split(piece):
                          yield prefix + chunk
+ 
+             # Anthony Baxter reported goodness from tokenizing src= params.
+             # XXX This made no difference in my tests:  both error rates
+             # XXX across 20 runs were identical before and after.  I suspect
+             # XXX this is because Anthony got most good out of the http
+             # XXX thingies in <img src="http://bozo.bozo.com">, but we
+             # XXX picked those up in the last step (in src params and
+             # XXX everywhere else).  So this code is commented out.
+             ## for fname in src_re.findall(text):
+             ##     for x in crack_filename(fname):
+             ##         yield "src:" + x
  
              # Remove HTML/XML tags if it's a plain text message.