[Spambayes-checkins] spambayes tokenizer.py,1.26,1.27

Thu, 19 Sep 2002 23:03:14 -0700

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv26473

Modified Files:
	tokenizer.py 
Log Message:
Removed the code in support of tokenizing src= thingies.  It was all
commented out because it made no difference when enabled.  Note that
we pick up all http:// thingies regardless of their context anyway.


Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.26
retrieving revision 1.27
diff -C2 -d -r1.26 -r1.27
*** tokenizer.py	20 Sep 2002 06:00:06 -0000	1.26
--- tokenizer.py	20 Sep 2002 06:03:12 -0000	1.27
***************
*** 578,593 ****
  subject_word_re = re.compile(r"[\w\x80-\xff$.%]+")
  
- # Anthony Baxter reported goodness from cracking src params.
- # Finding a src= thingie is complicated if we insist it appear in an
- # img or iframe tag, so this approximates reality with a fast and
- # non-stack-blowing simple regexp.
- src_re = re.compile(r"""
-     \s
-     src=['"]
-     (?!https?:)     # we suck out http thingies via a different gimmick
-     ([^'"]{1,128})  # capture the guts, but don't go wild
-     ['"]
- """, re.VERBOSE)
- 
  fname_sep_re = re.compile(r'[/\\:]')
  
--- 578,581 ----
***************
*** 1012,1026 ****
              for t in tokens:
                  yield t
- 
-             # Anthony Baxter reported goodness from tokenizing src= params.
-             # XXX This made no difference in my tests:  both error rates
-             # XXX across 20 runs were identical before and after.  I suspect
-             # XXX this is because Anthony got most good out of the http
-             # XXX thingies in <img src="http://bozo.bozo.com">, but we
-             # XXX picked those up in the last step (in src params and
-             # XXX everywhere else).  So this code is commented out.
-             ## for fname in src_re.findall(text):
-             ##     for x in crack_filename(fname):
-             ##         yield "src:" + x
  
              # Remove HTML/XML tags.
--- 1000,1003 ----