HTMLParser cannot parse some web pages?
Gillou
nospam at bigfoot.com
Wed Oct 17 08:25:16 EDT 2001
Paul,
Have a look at this...
http://www.oreilly.com/catalog/pythonsl/chapter/ch05.html
This includes an example that samples the links of a page.
Of course, an as most HTML sniffer, it cannot handle properly links which
target are results of javascript expressions.
--Gilles
"Paul Lim" <paullim at starhub.net.sg> a écrit dans le message news:
3BCD77D2.EBFB2D98 at starhub.net.sg...
> Hi,
> I am a newbie in Python. I hope the guru could advise me on the
> following
>
> I am trying to extract the links in html file.
> My code is shown below:
>
> The code works fine. But I just want to understand more about this
> HTMLParser module. Apparently, there are some webpages where I cannot
> extract the links.
> But I really don't understand why? An example is
> http://www.admissions.rmit.edu.au/about/index.html
>
> Is there certain limitation in this HTMLParser? For example, is it that
> it cannot extract from certain kind of web pages. If so, which kind?
>
> Thank you very much for your help.
>
> Sincerely
> Paul
>
> "To extract the links in a page."
>
> # To open a url and return url handler
> try:
> linkHandler = urllib.urlopen(link)
> except IOError:
> print "Unable to open url!"
>
>
> # Extract link from the HTML file and stored in anchorlist
> try:
> parser = HTMLParser(NullFormatter())
> parser.feed(linkHandler.read())
> except:
> print "Unable to extract!"
> pass
>
>
>
More information about the Python-list
mailing list