Simple allowing of HTML elements/attributes?

Alan Kennedy alanmk at hotmail.com
Thu Feb 12 14:13:52 EST 2004


[Alan Kennedy]
> The optimal solution, IMHO, is to tidy the HTML into XML, and then use
> SAX to filter out the stuff you don't want. Here is some code that
> does the latter. This should be nice and fast, and use a lot less
> memory than object-model based approaches.

Unfortunately, in my haste to post a demonstration of a technique
earlier on, I posted running code that is both buggy and *INSECURE*.
The following are problems with it

1. A bug in making up the attribute string results in loss of
permitted attributes.

2. The failure to escape character data (i.e. map '<' to '<' and
'>' to '>') as it is written out gives rise to the possibility of a
code injection attack. It's easy to circumvent the check for malicious
code: I'll leave to y'all to figure out how.

3. I have a feeling that the failure to escape the attribute values
also opens the possibility of a code injection attack. I'm not
certain: it depends on the browser environment in which the final HTML
is rendered.

Anyway, here's some updated code that closes the SECURITY HOLES in the
earlier-posted version :-(

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
import xml.sax
from xml.sax.saxutils import escape, quoteattr
import cStringIO as StringIO

permittedElements = ['html', 'body', 'b', 'i', 'p']
permittedAttrs = ['class', 'id', ]

class cleaner(xml.sax.handler.ContentHandler):

  def __init__(self):
    xml.sax.handler.ContentHandler.__init__(self)
    self.outbuf = StringIO.StringIO()

  def startElement(self, elemname, attrs):
    if elemname in permittedElements:
      attrstr = ""
      for a in attrs.keys():
        if a in permittedAttrs:
          attrstr = "%s%s" % (attrstr, " %s=%s" % (a,
quoteattr(attrs[a])))
      self.outbuf.write("<%s%s>" % (elemname, attrstr))

  def endElement(self, elemname):
    if elemname in permittedElements:
      self.outbuf.write("</%s>" % (elemname,))

  def characters(self, s):
    self.outbuf.write("%s" % (escape(s),))

testdoc = """
<html>
  <body>
    <p class="1" id="2">This paragraph contains <b>only</b> permitted
elements.</p>
    <p>This paragraph contains <i 
    onclick="javascript:pop('porno.htm')">disallowed
attributes</i>.</p>
    <img src="http://www.blackhat.com/session_hijack.gif"/>
    <p>This paragraph contains
    <script src="blackhat.js"/>a potential script
    attack</p>
  </body>
</html>
"""

if __name__ == "__main__":
  parser = xml.sax.make_parser()
  mycleaner = cleaner()
  parser.setContentHandler(mycleaner)
  parser.setFeature(xml.sax.handler.feature_namespaces, 0)
  parser.feed(testdoc)
  print mycleaner.outbuf.getvalue()
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

regards,

-- 
alan kennedy
------------------------------------------------------
check http headers here: http://xhaus.com/headers
email alan:              http://xhaus.com/contact/alan



More information about the Python-list mailing list