Simple allowing of HTML elements/attributes?

Thu Feb 12 08:00:30 EST 2004

[Leif K-Brooks]
>>> I'm writing a site with mod_python which will have, among other
>>> things, forums. I want to allow users to use some HTML (<em>,
>>> <strong>, <p>, etc.) on the forums, but I don't want to allow bad
>>> elements and attributes (onclick, <script>, etc.). I would also like
>>> to do basic validation (no overlapping elements like
>>> <strong><em>foo</em></strong>, no missing end tags). I'm not asking
>>> anyone to write a script for me, but does anyone have general ideas
>>> about how to do this quickly on an active forum?

"Quickly" being an important consideration for you, I'm presuming.

(David M. Cooke) 
>> You could require valid XML, and use a validating XML parser to
>> check conformance. You'd have to make sure the output is correctly
>> quoted (for instance, check that HTML tags in a CDATA block get quoted).

Hmmm, I'd imagine that the average forum user isn't going to know what
well-formed XML is. Also, validating-XML support is one of the areas
where python is lacking. Lastly, wrapping HTML tags in a CDATA block
won't deliver much benefit. You still have to send that HTML to the
browser, which will probably render the contents of the CDATA block
anyway.

[Graham Fawcett]
> You could use Tidy (or tidylib) to convert error-ridden input into
> valid HTML or XHTML, and then grab the BODY contents via an XML
> parser, as David suggested. I imagine that the library version of tidy
> is quick enough to meet your needs.

This is a good idea. Tidy is always a good way to get easily
processable XML from badly-formed HTML. There are multiple ways to run
Tidy from python: use MAL's utidy library, use the command line
executable and pipes, or in jython use JTidy.

http://sourceforge.net/projects/jtidy

[Graham Fawcett]
> Or maybe you could use XSLT to cut the "bad stuff" out of your tidied
> XHTML. (Not something I'm familiar with, but someone must have done
> this before.)

However, this is not a good idea. XSLT requires an Object Model of the
document, meaning that you're going to use a lot of cpu-time and
memory. In extreme cases, e.g. where some black-hat attempts to upload
a 20 Mbyte HTML file, you're opening yourself up to a
Denial-Of-Service attack, when your server tries to build up a [D]OM
of that document.

The optimal solution, IMHO, is to tidy the HTML into XML, and then use
SAX to filter out the stuff you don't want. Here is some code that
does the latter. This should be nice and fast, and use a lot less
memory than object-model based approaches.

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
import xml.sax
import cStringIO as StringIO

permittedElements = ['html', 'body', 'b', 'i', 'p']
permittedAttrs = ['class', 'id', ]

class cleaner(xml.sax.handler.ContentHandler):

  def __init__(self):
    xml.sax.handler.ContentHandler.__init__(self)
    self.outbuf = StringIO.StringIO()

  def startElement(self, elemname, attrs):
    if elemname in permittedElements:
      attrstr = ""
      for a in attrs.keys():
        if a in permittedAttrs:
          attrstr = "%s " % "%s='%s'" % (a, attrs[a])
      self.outbuf.write("<%s%s>" % (elemname, attrstr))

  def endElement(self, elemname):
    if elemname in permittedElements:
      self.outbuf.write("</%s>" % (elemname,))

  def characters(self, s):
    self.outbuf.write("%s" % (s,))

testdoc = """
<html>
  <body>
    <p>This paragraph contains <b>only</b> permitted elements.</p>
    <p>This paragraph contains <i 
    onclick="javascript:pop('porno.htm')">disallowed
attributes</i>.</p>
    <img src="http://www.blackhat.com/session_hijack.gif"/>
    <p>This paragraph contains
    <a href="http://www.jscript-attack.com/">a potential script
    attack</a></p>
  </body>
</html>
"""

if __name__ == "__main__":
  parser = xml.sax.make_parser()
  mycleaner = cleaner()
  parser.setContentHandler(mycleaner)
  parser.setFeature(xml.sax.handler.feature_namespaces, 0)
  parser.feed(testdoc)
  print mycleaner.outbuf.getvalue()
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Tidying the HTML to XML is left as an exercise to the reader ;-)

HTH,

-- 
alan kennedy
------------------------------------------------------
check http headers here: http://xhaus.com/headers
email alan:              http://xhaus.com/contact/alan