Parsing complex web pages safely with htmllib.HTMLParser

Thu Jan 24 05:34:01 EST 2002

abulka at netspace.net.au (Andy Bulka) wrote in message news:<13dc97b8.0201232152.66d56faa at posting.google.com>...
> The following snippet of code parses a web page on my disk and prints
> the urls found in it.  It works for everything I've tried but not the
> page I really want
>   http://www.bom.gov.au/cgi-bin/wrap_fwo.pl?IDV60029.html
> which lists the weather in my state.  Intead I get an exception
> SGMLParseError: unexpected char in declaration: '<'

This may well be caused by the presence of a "script" element.
Currently, the various standard library HTML parsers don't seem to
deal with "script" elements very well, especially when they contain
"<" characters in the enclosed code. What you can do is to preprocess
the page text using a function which introduces "CDATA" notation
within such elements - something like this seems to work (at least in
conjunction with the xml.dom.ext.reader interface to these parsers):

  <script ...><![CDATA[
    ...
  ]]></script>

> import htmllib
> import formatter
> parser=htmllib.HTMLParser(formatter.NullFormatter())
> parser.feed(open('ATROUBLESOMECOMPLEXPAGE.htm').read())
> parser.close()
> print parser.anchorlist

I tend to use sgmllib.SGMLParser and I've been working on a Web page
which describes it in use. I think that the "Dive Into Python"
(http://www.diveintopython.org) site also covers SGMLParser.

> MY QUESTION:  Is htmllib.HTMLParser likely to fail here and there, on
> complex or otherwise web pages?  Loading the above page into Frontpage
> and saving it out again does nothing to fix the problem - so its
> proably ok HTML.  What do I do about this - ask my Government Bureau
> of Meteorology to change the way they do their web pages ?!! Of course
> I can catch the exception, but I REALLY *want* the info on that
> weather page...

You probably won't have much luck asking people to change their pages,
especially if they are dynamic pages, produced by some dodgy
templating language. Another hint: if you still can't make any sense
out of a broken Web page, introduce mxTidy into your "processing
pipeline"...

  http://www.lemburg.com/files/python/mxTidy.html

Of course, what we all really need is for XHTML to come into
widespread use, so that we can consign broken HTML to history.

Paul