[XML-SIG] Exception handling with xml.dom.ext.reader.HtmlLib

Alan Kennedy pyxml@xhaus.com
Thu, 26 Apr 2001 10:04:45 +0100


Mark,

I'm currently working on an application that needs to parse HTML, and have
looked at htmllib as a way to do it.

However, htmllib seems to only parse HTML 2.0, and as you have pointed out,
is not very tolerant of the structural errors that typify a lot of HTML
pages.

One avenue I'm currently investigating is to use Dave Raggetts TIDY program,
which takes a messy HTML file and outputs a cleaned up version, i.e. tags
rebalanced, attributes quoted, etc, etc. It also has some support for XML
and XHTML. While this support is not complete, it is very good.

You can find tidy at

http://www.w3.org/People/Raggett/tidy/

This program is written in C, so it should be possible to use it directly
from Python. The documentation for Tidy mentions that someone has done a
SWIG interface for it.

There is also a Java version, which could be used from Jython fairly easily.

I had a look at the code to see if it might be feasible to insert some hooks
into it to turn it into a generator of SAX events, but the code is quite
messy, and the printing/output works at a character and buffer level, not an
element and attribute level.

Alan.