Parsing complex web pages safely with htmllib.HTMLParser
Joonas Paalasmaa
joonas at olen.to
Thu Jan 24 06:43:55 EST 2002
abulka at netspace.net.au (Andy Bulka) wrote in message news:<13dc97b8.0201232152.66d56faa at posting.google.com>...
> The following snippet of code parses a web page on my disk and prints
> the urls found in it. It works for everything I've tried but not the
> page I really want
> http://www.bom.gov.au/cgi-bin/wrap_fwo.pl?IDV60029.html
> which lists the weather in my state. Intead I get an exception
> SGMLParseError: unexpected char in declaration: '<'
>
> import htmllib
> import formatter
> parser=htmllib.HTMLParser(formatter.NullFormatter())
> parser.feed(open('ATROUBLESOMECOMPLEXPAGE.htm').read())
> parser.close()
> print parser.anchorlist
>
> MY QUESTION: Is htmllib.HTMLParser likely to fail here and there, on
> complex or otherwise web pages? Loading the above page into Frontpage
> and saving it out again does nothing to fix the problem - so its
> proably ok HTML. What do I do about this - ask my Government Bureau
> of Meteorology to change the way they do their web pages ?!! Of course
> I can catch the exception, but I REALLY *want* the info on that
> weather page...
>
> Or is this just a bug in htmllib.HTMLParser ?
Use HTML Tidy to clean up the page and then parse it with HTMLParser.
Tidy project page: http://tidy.sourceforge.net/
Python interface to tidy: http://www.lemburg.com/files/python/mxTidy.html
More information about the Python-list
mailing list