[XML-SIG] A "tolerant" parser for structure-challenged HTML files

Fri, 20 Jul 2001 17:03:03 +0200

A couple of weeks ago I was faced with the problem of processing a few
web pages which were generated by Microsoft Word (and post-processed
by some other structure-pessimizing program).  Among the various Python
*ML parsers I didn't find any that could retrieve the "intended document
structure" (like most browsers can) and that didn't choke on the input.

Therefore I wrote a "TolerantParser" class, based on sgmllib's parser,
which tries to understand input like, for example,

    <B><FONT SIZE=2><P>- 34/2001 -</B></FONT>

assuming here that </B> implies </P></FONT> and ignoring the following
</FONT>.  Although I didn't deal with every imaginable nonsense, this
worked for me; in a derived class I generate a minidom Document and
add the "accepted" HTML nodes to it.  Then I can use dom methods to
extract the data that I actually need.

Since I haven't seen anyone else doing this so far, I'd like to make
these classes publicly available
(<http://starship.python.net/~lannert/tweak_html.py>) and to solicit
your comments.  If anything like (or, probably, better than) this
exists somewhere, please let me know; I'd also love to hear any
criticism or suggestions.

  Detlef