Parsing complex web pages safely with htmllib.HTMLParser

Bernard Yue bernie at 3captus.com
Thu Jan 24 02:40:10 EST 2002


Hi Andy,

> MY QUESTION:  Is htmllib.HTMLParser likely to fail here and there, on
> complex or otherwise web pages?  Loading the above page into Frontpage
> and saving it out again does nothing to fix the problem - so its
> proably ok HTML.  What do I do about this - ask my Government Bureau
> of Meteorology to change the way they do their web pages ?!! Of course
> I can catch the exception, but I REALLY *want* the info on that
> weather page...
> 
> Or is this just a bug in htmllib.HTMLParser ?
> 

I've been using htmllib.HTMLParser for quite a while.  It does it's job
very well, if you feed it with a valid HTML document.  Actually, I've
using it to parse a lot of web pages (a lot of it more complex and
larger in size than the one you are now having problem with).  The hard
truth is that when it comes to ill formed HTML, HTMLParser does not do
the job as good as the web browser (Your page actually displayed on
Netscape and Konqueror as well).

The page you are refering to is quite far away from a valid HTML
document.  I've try the page with w3c's HTML validator
(http://validator.w3.org/check/referer), it doesn't look good.

If you read the page with a standard editor, and assume that you are
familar with HTML standard.  You will notice there are quite a lot of
errors within the page (like two <HTML> and </HTML> tag, a lot of ^M
characters <- under a dos editor).

I was trying to clean up the html a bit to make it pass the parser. 
However, the page contains too much errors that I think I will have to
spend another half an hour to do it.  So I stop.

Maybe you can try it youself and suggest the change to the webmaster.


Bernie


Andy Bulka wrote:
> 
> The following snippet of code parses a web page on my disk and prints
> the urls found in it.  It works for everything I've tried but not the
> page I really want
>   http://www.bom.gov.au/cgi-bin/wrap_fwo.pl?IDV60029.html
> which lists the weather in my state.  Intead I get an exception
> SGMLParseError: unexpected char in declaration: '<'
> 
> import htmllib
> import formatter
> parser=htmllib.HTMLParser(formatter.NullFormatter())
> parser.feed(open('ATROUBLESOMECOMPLEXPAGE.htm').read())
> parser.close()
> print parser.anchorlist
> 

> Andy Bulka
> Australia
> www.atug.com/andypatterns

-- 
There are three schools of magic.  One:  State a tautology, then ring
the changes on its corollaries; that's philosophy.  Two:  Record many
facts.  Try to find a pattern.  Then make a wrong guess at the next
fact; that's science.  Three:  Be aware that you live in a malevolent
Universe controlled by Murphy's Law, sometimes offset by Brewster's
Factor; that's engineering.

So far as I can remember, there is not one word in the Gospels in
praise of intelligence.
                -- Bertrand Russell



More information about the Python-list mailing list