[XML-SIG] Exception handling with xml.dom.ext.reader.HtmlLib

Mark Bennett mbennett@ideaeng.com
Wed, 25 Apr 2001 16:43:06 -0700


I was thrilled to see sample python scripts for parsing HTML.  The 
library seems to handle lots of common mistakes like unbalanced tags,
etc, things that most XML parsers will reject by design.  By it's 
nature, HTML is rarely in proper XML format.

But I've hit a couple snags with the library and I was wondering 
if anybody had any ideas?

* There are some classes of common HTML mistakes that it doesn't
  handle, like unbalanced quotes.  As in <font color="red> or
  <font color=red">, the second form gives a stack dump.
* When it does crash it doesn't give you any information about
  the source file, like what line it was looking at.  Such info would
  be helpful.
* Though I don't know the exact cause, it doesn't handle pages
  like http://www.cnn.com

I'm not a parsing expert, but I'd be happy to contribute to any efforts 
to make the parser more robust.  Processing existing (poorly formed) 
HTML is the 800 pound gorilla for lots of XML applications.  This 
library does go a long way.

Thanks,
Mark