[Spambayes-checkins] spambayes tokenizer.py,1.26,1.27
Tim Peters
tim_one@users.sourceforge.net
Thu, 19 Sep 2002 23:03:14 -0700
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv26473
Modified Files:
tokenizer.py
Log Message:
Removed the code in support of tokenizing src= thingies. It was all
commented out because it made no difference when enabled. Note that
we pick up all http:// thingies regardless of their context anyway.
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.26
retrieving revision 1.27
diff -C2 -d -r1.26 -r1.27
*** tokenizer.py 20 Sep 2002 06:00:06 -0000 1.26
--- tokenizer.py 20 Sep 2002 06:03:12 -0000 1.27
***************
*** 578,593 ****
subject_word_re = re.compile(r"[\w\x80-\xff$.%]+")
- # Anthony Baxter reported goodness from cracking src params.
- # Finding a src= thingie is complicated if we insist it appear in an
- # img or iframe tag, so this approximates reality with a fast and
- # non-stack-blowing simple regexp.
- src_re = re.compile(r"""
- \s
- src=['"]
- (?!https?:) # we suck out http thingies via a different gimmick
- ([^'"]{1,128}) # capture the guts, but don't go wild
- ['"]
- """, re.VERBOSE)
-
fname_sep_re = re.compile(r'[/\\:]')
--- 578,581 ----
***************
*** 1012,1026 ****
for t in tokens:
yield t
-
- # Anthony Baxter reported goodness from tokenizing src= params.
- # XXX This made no difference in my tests: both error rates
- # XXX across 20 runs were identical before and after. I suspect
- # XXX this is because Anthony got most good out of the http
- # XXX thingies in <img src="http://bozo.bozo.com">, but we
- # XXX picked those up in the last step (in src params and
- # XXX everywhere else). So this code is commented out.
- ## for fname in src_re.findall(text):
- ## for x in crack_filename(fname):
- ## yield "src:" + x
# Remove HTML/XML tags.
--- 1000,1003 ----