HTML cleaner?

Leif K-Brooks eurleif at ecritters.biz
Mon Apr 25 06:23:43 EDT 2005


Ivan Voras wrote:
> Is there a HTML clean/tidy library or module written in pure python? I 
> found mxTidy, but it's a interface to command-line tool.
> 
> What I'm searching is something that will accept a list of allowed tags 
> and/or attributes and strip the rest from HTML string.

Here's a module I wrote to do something along the lines of what you 
want: <http://ecritters.biz/limithtml.py>. Unfortunately, it requires 
the HTML to be relatively well-formed (e.g. it doesn't like things like 
"<i><b>foo</i></b>"), so I feed the HTML into uTidyLib (another 
interface to HTML Tidy) first. I'm not sure why you don't want to use 
Tidy, but if you do change your mind, you should be able to use my 
module alongside Tidy to limit the HTML elements and attributes which 
will be accepted.



More information about the Python-list mailing list