[XML-SIG] PyXML Question

Alan Kennedy pyxml@xhaus.com
Sat, 06 Oct 2001 13:19:56 +0100


"Martin v. Loewis" wrote:

> Note that processing HTML with XML libraries is always risky, as HTML
> documents are not XML documents (unless they comply with XHTML);
> often, they don't even comply with the HTML DTD. In these cases,
> processors can easily get confused.

Although I haven't used the Python version, Dave Raggetts excellent Tidy
program will clean up malformed HTML and turn it into XHTML, which should
then be parsable by XML processors.

Marc-Andre Lemburg has provided a python interface to HTML tidy, which is
now a part of the Egenix Experimental Package. You can find it here:-

http://www.lemburg.com/files/python/index.html

My memory of my use of HTML tidy is that coverage is very good of most of
the common problems you would encounter processing malformed HTML as XML.
For example, I think it will wrap the content of <SCRIPT> elements in
<![CDATA[ ]]> markers so that your XML parser won't choke on [<>&]
characters that might be found in Javascript code.

The only thing missing from the original HTML Tidy was a way to generate
the tidied output as a SAX stream. Instead, you have to put the ouput into
a file or a string and parse it into whatever XML form you require, using
the standard PyXML parsing tools.

Alan.