Simple allowing of HTML elements/attributes?

Robert Brewer fumanchu at amor.org
Thu Feb 12 18:58:09 EST 2004


Alan Kennedy wrote:
> The optimal solution, IMHO, is to tidy the HTML into XML, and then use
> SAX to filter out the stuff you don't want. Here is some code that
> does the latter. This should be nice and fast, and use a lot less
> memory than object-model based approaches.
> 
> #-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> import xml.sax
> import cStringIO as StringIO
> 
> permittedElements = ['html', 'body', 'b', 'i', 'p']
> permittedAttrs = ['class', 'id', ]
> 
> class cleaner(xml.sax.handler.ContentHandler):
> 
>   def __init__(self):
>     xml.sax.handler.ContentHandler.__init__(self)
>     self.outbuf = StringIO.StringIO()
> 
>   def startElement(self, elemname, attrs):
>     if elemname in permittedElements:
>       attrstr = ""
>       for a in attrs.keys():
>         if a in permittedAttrs:
>           attrstr = "%s " % "%s='%s'" % (a, attrs[a])
>       self.outbuf.write("<%s%s>" % (elemname, attrstr))

Very interesting, Alan! I rolled my own solution to this the other day,
relying more on regexes. This might be more usable.

One issue: the parser, as written, mangles well-formed xhtml tags like
<br /> into <br></br>. Any recommendations besides brute-force (keeping
a list of allowed empty tags) for dealing with this?


Robert Brewer
MIS
Amor Ministries
fumanchu at amor.org




More information about the Python-list mailing list