Parsing complex web pages safely with htmllib.HTMLParser

Thu Jan 24 06:43:55 EST 2002

abulka at netspace.net.au (Andy Bulka) wrote in message news:<13dc97b8.0201232152.66d56faa at posting.google.com>...
> The following snippet of code parses a web page on my disk and prints
> the urls found in it.  It works for everything I've tried but not the
> page I really want
>   http://www.bom.gov.au/cgi-bin/wrap_fwo.pl?IDV60029.html
> which lists the weather in my state.  Intead I get an exception
> SGMLParseError: unexpected char in declaration: '<'
> 
> import htmllib
> import formatter
> parser=htmllib.HTMLParser(formatter.NullFormatter())
> parser.feed(open('ATROUBLESOMECOMPLEXPAGE.htm').read())
> parser.close()
> print parser.anchorlist
> 
> MY QUESTION:  Is htmllib.HTMLParser likely to fail here and there, on
> complex or otherwise web pages?  Loading the above page into Frontpage
> and saving it out again does nothing to fix the problem - so its
> proably ok HTML.  What do I do about this - ask my Government Bureau
> of Meteorology to change the way they do their web pages ?!! Of course
> I can catch the exception, but I REALLY *want* the info on that
> weather page...
> 
> Or is this just a bug in htmllib.HTMLParser ?

Use HTML Tidy to clean up the page and then parse it with HTMLParser.

Tidy project page: http://tidy.sourceforge.net/
Python interface to tidy: http://www.lemburg.com/files/python/mxTidy.html