Python web client anyone?

Mon Oct 15 10:53:17 EDT 2001

On 14 Oct 2001, Paul Rubin wrote:

> fine.  What I *really* want is to be able to easily find link objects
> (anchor tags) based on the anchor text, which LWP for some reason
> doesn't do, but DOM extraction would be a start.  By "anchor text" I
> mean the text in <a href=blah.html>this is the anchor text</a>.  The
> client should be able to find some "underlined" text on the page it
> retrieves, and "click" on the linked document.

Try something like this:

import sgmllib

class HTMLParser(sgmllib.SGMLParser):
  def __init__(self):
    sgmllib.SGMLParser.__init__(self)
    self.insideTag = 0
    self.links = []

  def parseLinks(self, data):
    self.feed(data)
    self.close()
    return self.links

  def start_a(self, args):
    for key, value in args:
      if key.lower() == 'href':
        self.insideTag = 1
        self.lastHref = value

  def handle_data(self, data):
    if self.insideTag:
      self.hrefText = data

  def end_a(self):
    self.links.append((self.lastHref, self.hrefText))
    self.insideTag = 0

Use it like this:
data = open('somefile.html').read()
links = HtmlParser().parseLinks(data)

-Dave