Parsing complex web pages safely with htmllib.HTMLParser
Andy Bulka
abulka at netspace.net.au
Thu Jan 24 00:52:30 EST 2002
The following snippet of code parses a web page on my disk and prints
the urls found in it. It works for everything I've tried but not the
page I really want
http://www.bom.gov.au/cgi-bin/wrap_fwo.pl?IDV60029.html
which lists the weather in my state. Intead I get an exception
SGMLParseError: unexpected char in declaration: '<'
import htmllib
import formatter
parser=htmllib.HTMLParser(formatter.NullFormatter())
parser.feed(open('ATROUBLESOMECOMPLEXPAGE.htm').read())
parser.close()
print parser.anchorlist
MY QUESTION: Is htmllib.HTMLParser likely to fail here and there, on
complex or otherwise web pages? Loading the above page into Frontpage
and saving it out again does nothing to fix the problem - so its
proably ok HTML. What do I do about this - ask my Government Bureau
of Meteorology to change the way they do their web pages ?!! Of course
I can catch the exception, but I REALLY *want* the info on that
weather page...
Or is this just a bug in htmllib.HTMLParser ?
Andy Bulka
Australia
www.atug.com/andypatterns
More information about the Python-list
mailing list