[XML-SIG] Exception handling with xml.dom.ext.reader.HtmlLib
Mark Bennett
mbennett@ideaeng.com
Wed, 25 Apr 2001 16:43:06 -0700
I was thrilled to see sample python scripts for parsing HTML. The
library seems to handle lots of common mistakes like unbalanced tags,
etc, things that most XML parsers will reject by design. By it's
nature, HTML is rarely in proper XML format.
But I've hit a couple snags with the library and I was wondering
if anybody had any ideas?
* There are some classes of common HTML mistakes that it doesn't
handle, like unbalanced quotes. As in <font color="red> or
<font color=red">, the second form gives a stack dump.
* When it does crash it doesn't give you any information about
the source file, like what line it was looking at. Such info would
be helpful.
* Though I don't know the exact cause, it doesn't handle pages
like http://www.cnn.com
I'm not a parsing expert, but I'd be happy to contribute to any efforts
to make the parser more robust. Processing existing (poorly formed)
HTML is the 800 pound gorilla for lots of XML applications. This
library does go a long way.
Thanks,
Mark