[XML-SIG] Parsing HTML to DOM

Wade Leftwich wade@okaynetwork.com
Sat, 06 Oct 2001 10:07:23 -0400


Alexandre Fayolle wrote:
>the official way of creating a DOM tree is buy using a reader class, such
>as xml.dom.ext.reader.Sax2.Reader class. If what you want to process html,
>you'll want to use xml.dom.ext.reader.HtmlLib.Reader.
>

Because HTMLTidy (http://www.w3.org/People/Raggett/tidy/) is so good at making sense of funky HTML, I use it to produce XHTML, which can then be processed with XML tools.

I made a little Python module that calls the command line version of HTMLtidy with the appropriate arguments. Will be happy to share if anyone wants to see it.

Wade Leftwich
Ithaca, NY