Simple allowing of HTML elements/attributes?
Robert Brewer
fumanchu at amor.org
Thu Feb 12 18:58:09 EST 2004
Alan Kennedy wrote:
> The optimal solution, IMHO, is to tidy the HTML into XML, and then use
> SAX to filter out the stuff you don't want. Here is some code that
> does the latter. This should be nice and fast, and use a lot less
> memory than object-model based approaches.
>
> #-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> import xml.sax
> import cStringIO as StringIO
>
> permittedElements = ['html', 'body', 'b', 'i', 'p']
> permittedAttrs = ['class', 'id', ]
>
> class cleaner(xml.sax.handler.ContentHandler):
>
> def __init__(self):
> xml.sax.handler.ContentHandler.__init__(self)
> self.outbuf = StringIO.StringIO()
>
> def startElement(self, elemname, attrs):
> if elemname in permittedElements:
> attrstr = ""
> for a in attrs.keys():
> if a in permittedAttrs:
> attrstr = "%s " % "%s='%s'" % (a, attrs[a])
> self.outbuf.write("<%s%s>" % (elemname, attrstr))
Very interesting, Alan! I rolled my own solution to this the other day,
relying more on regexes. This might be more usable.
One issue: the parser, as written, mangles well-formed xhtml tags like
<br /> into <br></br>. Any recommendations besides brute-force (keeping
a list of allowed empty tags) for dealing with this?
Robert Brewer
MIS
Amor Ministries
fumanchu at amor.org
More information about the Python-list
mailing list