Simple allowing of HTML elements/attributes?

Alan Kennedy alanmk at
Thu Feb 12 14:13:52 EST 2004

[Alan Kennedy]
> The optimal solution, IMHO, is to tidy the HTML into XML, and then use
> SAX to filter out the stuff you don't want. Here is some code that
> does the latter. This should be nice and fast, and use a lot less
> memory than object-model based approaches.

Unfortunately, in my haste to post a demonstration of a technique
earlier on, I posted running code that is both buggy and *INSECURE*.
The following are problems with it

1. A bug in making up the attribute string results in loss of
permitted attributes.

2. The failure to escape character data (i.e. map '<' to '<' and
'>' to '>') as it is written out gives rise to the possibility of a
code injection attack. It's easy to circumvent the check for malicious
code: I'll leave to y'all to figure out how.

3. I have a feeling that the failure to escape the attribute values
also opens the possibility of a code injection attack. I'm not
certain: it depends on the browser environment in which the final HTML
is rendered.

Anyway, here's some updated code that closes the SECURITY HOLES in the
earlier-posted version :-(

import xml.sax
from xml.sax.saxutils import escape, quoteattr
import cStringIO as StringIO

permittedElements = ['html', 'body', 'b', 'i', 'p']
permittedAttrs = ['class', 'id', ]

class cleaner(xml.sax.handler.ContentHandler):

  def __init__(self):
    self.outbuf = StringIO.StringIO()

  def startElement(self, elemname, attrs):
    if elemname in permittedElements:
      attrstr = ""
      for a in attrs.keys():
        if a in permittedAttrs:
          attrstr = "%s%s" % (attrstr, " %s=%s" % (a,
      self.outbuf.write("<%s%s>" % (elemname, attrstr))

  def endElement(self, elemname):
    if elemname in permittedElements:
      self.outbuf.write("</%s>" % (elemname,))

  def characters(self, s):
    self.outbuf.write("%s" % (escape(s),))

testdoc = """
    <p class="1" id="2">This paragraph contains <b>only</b> permitted
    <p>This paragraph contains <i 
    <img src=""/>
    <p>This paragraph contains
    <script src="blackhat.js"/>a potential script

if __name__ == "__main__":
  parser = xml.sax.make_parser()
  mycleaner = cleaner()
  parser.setFeature(xml.sax.handler.feature_namespaces, 0)
  print mycleaner.outbuf.getvalue()


