Parsing complex web pages safely with htmllib.HTMLParser

Andy Bulka abulka at netspace.net.au
Thu Jan 24 00:52:30 EST 2002


The following snippet of code parses a web page on my disk and prints
the urls found in it.  It works for everything I've tried but not the
page I really want
  http://www.bom.gov.au/cgi-bin/wrap_fwo.pl?IDV60029.html
which lists the weather in my state.  Intead I get an exception
SGMLParseError: unexpected char in declaration: '<'

import htmllib
import formatter
parser=htmllib.HTMLParser(formatter.NullFormatter())
parser.feed(open('ATROUBLESOMECOMPLEXPAGE.htm').read())
parser.close()
print parser.anchorlist

MY QUESTION:  Is htmllib.HTMLParser likely to fail here and there, on
complex or otherwise web pages?  Loading the above page into Frontpage
and saving it out again does nothing to fix the problem - so its
proably ok HTML.  What do I do about this - ask my Government Bureau
of Meteorology to change the way they do their web pages ?!! Of course
I can catch the exception, but I REALLY *want* the info on that
weather page...

Or is this just a bug in htmllib.HTMLParser ?

Andy Bulka
Australia
www.atug.com/andypatterns



More information about the Python-list mailing list