[Spambayes-checkins] spambayes tokenizer.py,1.39,1.40

Thu, 26 Sep 2002 18:28:46 -0700

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv30138

Modified Files:
	tokenizer.py 
Log Message:
Beefed up HTML stripping:  Accepts more kinds of <style> openers.
Strips "long" HTML comments.  Doubled the maximum gut-length of other
kinds of tags.  Folded these all into one regexp.

This redeemed one of my marginal false positives under the f(w) scheme,
leaving me with 2 fp (out of 20,000) and 18 fn (out of 14,000).  It
reduced both the ham and the spam mean scores a little, increased the
ham score variance, and decreased the spam score variance.

Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.39
retrieving revision 1.40
diff -C2 -d -r1.39 -r1.40
*** tokenizer.py	27 Sep 2002 00:08:13 -0000	1.39
--- tokenizer.py	27 Sep 2002 01:28:43 -0000	1.40
***************
*** 578,588 ****
  html_re = re.compile(r"""
      <
!     [^\s<>]     # e.g., don't match 'a < b' or '<<<' or 'i << 5' or 'a<>b'
!     [^>]{0,128} # search for the end '>', but don't run wild
      >
! """, re.VERBOSE)
! 
! # An equally cheap-ass gimmick to strip style sheets
! stylesheet_re = re.compile(r"<style>.{0,2000}?</style>", re.DOTALL)

  received_host_re = re.compile(r'from (\S+)\s')
--- 578,596 ----
  html_re = re.compile(r"""
      <
!     (?![\s<>])  # e.g., don't match 'a < b' or '<<<' or 'i<<5' or 'a<>b'
!     (?:
!         # style sheets can be very long
!         style\b     # maybe it's <style>, or maybe <style type=...>, etc.
!         .{0,2048}?
!         </style
!     |   # so can comments
!         !--
!         .{0,2048}?
!         --
!     |   # guessing that other tags are usually "short"
!         [^>]{0,256} # search for the end '>', but don't run wild
!     )
      >
! """, re.VERBOSE | re.DOTALL)

  received_host_re = re.compile(r'from (\S+)\s')
***************
*** 1047,1051 ****
              if (part.get_content_type() == "text/plain" or
                      not options.retain_pure_html_tags):
-                 text = stylesheet_re.sub(' ', text)
                  text = html_re.sub(' ', text)

--- 1055,1058 ----