[Spambayes-checkins] spambayes tokenizer.py,1.39,1.40
Tim Peters
tim_one@users.sourceforge.net
Thu, 26 Sep 2002 18:28:46 -0700
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv30138
Modified Files:
tokenizer.py
Log Message:
Beefed up HTML stripping: Accepts more kinds of <style> openers.
Strips "long" HTML comments. Doubled the maximum gut-length of other
kinds of tags. Folded these all into one regexp.
This redeemed one of my marginal false positives under the f(w) scheme,
leaving me with 2 fp (out of 20,000) and 18 fn (out of 14,000). It
reduced both the ham and the spam mean scores a little, increased the
ham score variance, and decreased the spam score variance.
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.39
retrieving revision 1.40
diff -C2 -d -r1.39 -r1.40
*** tokenizer.py 27 Sep 2002 00:08:13 -0000 1.39
--- tokenizer.py 27 Sep 2002 01:28:43 -0000 1.40
***************
*** 578,588 ****
html_re = re.compile(r"""
<
! [^\s<>] # e.g., don't match 'a < b' or '<<<' or 'i << 5' or 'a<>b'
! [^>]{0,128} # search for the end '>', but don't run wild
>
! """, re.VERBOSE)
!
! # An equally cheap-ass gimmick to strip style sheets
! stylesheet_re = re.compile(r"<style>.{0,2000}?</style>", re.DOTALL)
received_host_re = re.compile(r'from (\S+)\s')
--- 578,596 ----
html_re = re.compile(r"""
<
! (?![\s<>]) # e.g., don't match 'a < b' or '<<<' or 'i<<5' or 'a<>b'
! (?:
! # style sheets can be very long
! style\b # maybe it's <style>, or maybe <style type=...>, etc.
! .{0,2048}?
! </style
! | # so can comments
! !--
! .{0,2048}?
! --
! | # guessing that other tags are usually "short"
! [^>]{0,256} # search for the end '>', but don't run wild
! )
>
! """, re.VERBOSE | re.DOTALL)
received_host_re = re.compile(r'from (\S+)\s')
***************
*** 1047,1051 ****
if (part.get_content_type() == "text/plain" or
not options.retain_pure_html_tags):
- text = stylesheet_re.sub(' ', text)
text = html_re.sub(' ', text)
--- 1055,1058 ----