HTMLParser cannot parse some web pages?

Gillou nospam at bigfoot.com
Wed Oct 17 08:25:16 EDT 2001


Paul,

Have a look at this...

http://www.oreilly.com/catalog/pythonsl/chapter/ch05.html

This includes an example that samples the links of a page.

Of course, an as most HTML sniffer, it cannot handle properly links which
target are results of javascript expressions.

--Gilles

"Paul Lim" <paullim at starhub.net.sg> a écrit dans le message news:
3BCD77D2.EBFB2D98 at starhub.net.sg...
> Hi,
> I am a newbie in Python. I hope the guru could advise me on the
> following
>
> I am trying to extract the links in html file.
> My code is shown below:
>
> The code works fine. But I just want to understand more about this
> HTMLParser module. Apparently, there are some webpages where I cannot
> extract the links.
> But I really don't understand why? An example is
> http://www.admissions.rmit.edu.au/about/index.html
>
> Is there certain limitation in this HTMLParser? For example, is it that
> it cannot extract from certain kind of web pages. If so, which kind?
>
> Thank you very much for your help.
>
> Sincerely
> Paul
>
> "To extract the links in a page."
>
>  # To open a url and return url handler
>  try:
>   linkHandler = urllib.urlopen(link)
>  except IOError:
>   print "Unable to open url!"
>
>
>  # Extract link from the HTML file and stored in anchorlist
>  try:
>   parser = HTMLParser(NullFormatter())
>   parser.feed(linkHandler.read())
>  except:
>   print "Unable to extract!"
>   pass
>
>
>





More information about the Python-list mailing list