Help: HTMLParser cannot parse some web pages?

Paul Lim paullim at starhub.net.sg
Wed Oct 17 08:21:39 EDT 2001


Hi,
I am a newbie in Python. I hope the guru could advise me on the
following

I am trying to extract the links in html file.
My code is shown below:

The code works fine. But I just want to understand more about this
HTMLParser module. Apparently, there are some webpages where I cannot
extract the links.
But I really don't understand why? An example is
http://www.admissions.rmit.edu.au/about/index.html

Is there certain limitation in this HTMLParser? For example, is it that
it cannot extract from certain kind of web pages. If so, which kind?

Thank you very much for your help.

Sincerely
Paul

"To extract the links in a page."

 # To open a url and return url handler
 try:
  linkHandler = urllib.urlopen(link)
 except IOError:
  print "Unable to open url!"


 # Extract link from the HTML file and stored in anchorlist
 try:
  parser = HTMLParser(NullFormatter())
  parser.feed(linkHandler.read())
 except:
  print "Unable to extract!"
  pass






More information about the Python-list mailing list