Simple allowing of HTML elements/attributes?
Alan Kennedy
alanmk at hotmail.com
Thu Feb 12 14:13:52 EST 2004
[Alan Kennedy]
> The optimal solution, IMHO, is to tidy the HTML into XML, and then use
> SAX to filter out the stuff you don't want. Here is some code that
> does the latter. This should be nice and fast, and use a lot less
> memory than object-model based approaches.
Unfortunately, in my haste to post a demonstration of a technique
earlier on, I posted running code that is both buggy and *INSECURE*.
The following are problems with it
1. A bug in making up the attribute string results in loss of
permitted attributes.
2. The failure to escape character data (i.e. map '<' to '<' and
'>' to '>') as it is written out gives rise to the possibility of a
code injection attack. It's easy to circumvent the check for malicious
code: I'll leave to y'all to figure out how.
3. I have a feeling that the failure to escape the attribute values
also opens the possibility of a code injection attack. I'm not
certain: it depends on the browser environment in which the final HTML
is rendered.
Anyway, here's some updated code that closes the SECURITY HOLES in the
earlier-posted version :-(
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
import xml.sax
from xml.sax.saxutils import escape, quoteattr
import cStringIO as StringIO
permittedElements = ['html', 'body', 'b', 'i', 'p']
permittedAttrs = ['class', 'id', ]
class cleaner(xml.sax.handler.ContentHandler):
def __init__(self):
xml.sax.handler.ContentHandler.__init__(self)
self.outbuf = StringIO.StringIO()
def startElement(self, elemname, attrs):
if elemname in permittedElements:
attrstr = ""
for a in attrs.keys():
if a in permittedAttrs:
attrstr = "%s%s" % (attrstr, " %s=%s" % (a,
quoteattr(attrs[a])))
self.outbuf.write("<%s%s>" % (elemname, attrstr))
def endElement(self, elemname):
if elemname in permittedElements:
self.outbuf.write("</%s>" % (elemname,))
def characters(self, s):
self.outbuf.write("%s" % (escape(s),))
testdoc = """
<html>
<body>
<p class="1" id="2">This paragraph contains <b>only</b> permitted
elements.</p>
<p>This paragraph contains <i
onclick="javascript:pop('porno.htm')">disallowed
attributes</i>.</p>
<img src="http://www.blackhat.com/session_hijack.gif"/>
<p>This paragraph contains
<script src="blackhat.js"/>a potential script
attack</p>
</body>
</html>
"""
if __name__ == "__main__":
parser = xml.sax.make_parser()
mycleaner = cleaner()
parser.setContentHandler(mycleaner)
parser.setFeature(xml.sax.handler.feature_namespaces, 0)
parser.feed(testdoc)
print mycleaner.outbuf.getvalue()
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
regards,
--
alan kennedy
------------------------------------------------------
check http headers here: http://xhaus.com/headers
email alan: http://xhaus.com/contact/alan
More information about the Python-list
mailing list