HTML filtering
Stuart D. Gathman
stuart at bmsi.com
Wed May 1 16:04:05 EDT 2002
On Wed, 01 May 2002 15:06:36 -0400, Stuart D. Gathman wrote:
> I need to filter HTML to remove certain constructs (e.g. <script ...>
> ... </script>). I am trying to use the batteries. The htmllib module
> helps with the parsing, but it seems like a lot of work to create a
> formatter that passes everything (except script) through in HTML syntax
> - espicially trying to preserve original syntax. Am I missing
> something?
Here is a attempt at making a "pass through" HTML filter. It changes the
case of end tags (e.g. "</A>" -> "</a>"). Is there a way to fix that?
import sys
import sgmllib
class HTMLFilter(sgmllib.SGMLParser):
"Parse HTML and pass through all constructs unchanged. It is intended for
derived classes to implement exceptional processing for selected cases."
def handle_comment(self,comment):
sys.stdout.write("<!--%s-->" % comment)
def unknown_starttag(self,tag,attr):
sys.stdout.write(self.get_starttag_text())
# sys.stdout.write("<%s" % tag)
# for (key,val) in attr:
# sys.stdout.write(' %s="%s"' % (key,val))
# sys.stdout.write('>')
def handle_data(self,data):
sys.stdout.write(data)
def handle_entityref(self,ref):
sys.stdout.write("&%s;" % ref)
def handle_charref(self,ref):
sys.stdout.write("&#%s;" % ref)
def unknown_endtag(self,tag):
sys.stdout.write("</%s>" % tag)
--
Stuart D. Gathman <stuart at bmsi.com>
Business Management Systems Inc. Phone: 703 591-0911 Fax: 703 591-6154
"Confutatis maledictis, flamis acribus addictis" - background song for
a Microsoft sponsored "Where do you want to go from here?" commercial.
More information about the Python-list
mailing list