HTML filtering

Stuart D. Gathman stuart at bmsi.com
Wed May 1 16:04:05 EDT 2002


On Wed, 01 May 2002 15:06:36 -0400, Stuart D. Gathman wrote:

> I need to filter HTML to remove certain constructs (e.g. <script ...>
> ... </script>).  I am trying to use the batteries.  The htmllib module
> helps with the parsing, but it seems like a lot of work to create a
> formatter that passes everything (except script) through in HTML syntax
> - espicially trying to preserve original syntax.  Am I missing
> something?

Here is a attempt at making a "pass through" HTML filter.  It changes the
case of end tags (e.g. "</A>" -> "</a>").  Is there a way to fix that?

import sys
import sgmllib

class HTMLFilter(sgmllib.SGMLParser):
  "Parse HTML and pass through all constructs unchanged.  It is intended for
   derived classes to implement exceptional processing for selected cases."

  def handle_comment(self,comment):
    sys.stdout.write("<!--%s-->" % comment)

  def unknown_starttag(self,tag,attr):
    sys.stdout.write(self.get_starttag_text())
#    sys.stdout.write("<%s" % tag)
#    for (key,val) in attr:
#      sys.stdout.write(' %s="%s"' % (key,val))
#    sys.stdout.write('>')

  def handle_data(self,data):
    sys.stdout.write(data)

  def handle_entityref(self,ref):
    sys.stdout.write("&%s;" % ref)

  def handle_charref(self,ref):
    sys.stdout.write("&#%s;" % ref)
      
  def unknown_endtag(self,tag):
    sys.stdout.write("</%s>" % tag)



-- 
	      Stuart D. Gathman <stuart at bmsi.com>
Business Management Systems Inc.  Phone: 703 591-0911 Fax: 703 591-6154
"Confutatis maledictis, flamis acribus addictis" - background song for
a Microsoft sponsored "Where do you want to go from here?" commercial.



More information about the Python-list mailing list