Re: I'm looking for html cleaner. Example : convert <h1><span><font>my title</font></span></h1> => <h1>my title</h1>…

John Nagle nagle at animats.com
Mon Mar 29 20:35:09 EDT 2010


Stéphane Klein wrote:
> Hi,
> 
> I work on HTML cleaner.
> 
> I export OpenOffice.org documents to HTML.
> Next, I would like clean this HTML export files :
> 
> * remove comment
> * remove style
> * remove dispensable tag
> * ...

    Try parsing with HTML5 Parser ("http://code.google.com/p/html5lib/") which
is the closest thing to a good parser available for Python.  It's basically
a reference implementation of HTML5, including all the handling of bad HTML.

    Once you have a tree, write something to go through the tree and remove
empty tags from a list of tags which do nothing when empty.  Then
regenerate HTML from the tree.

    Or just use HTML Tidy: "http://www.w3.org/People/Raggett/tidy/"

					John Nagle



More information about the Python-list mailing list