Python web client anyone?
brueckd at tbye.com
brueckd at tbye.com
Mon Oct 15 10:53:17 EDT 2001
On 14 Oct 2001, Paul Rubin wrote:
> fine. What I *really* want is to be able to easily find link objects
> (anchor tags) based on the anchor text, which LWP for some reason
> doesn't do, but DOM extraction would be a start. By "anchor text" I
> mean the text in <a href=blah.html>this is the anchor text</a>. The
> client should be able to find some "underlined" text on the page it
> retrieves, and "click" on the linked document.
Try something like this:
import sgmllib
class HTMLParser(sgmllib.SGMLParser):
def __init__(self):
sgmllib.SGMLParser.__init__(self)
self.insideTag = 0
self.links = []
def parseLinks(self, data):
self.feed(data)
self.close()
return self.links
def start_a(self, args):
for key, value in args:
if key.lower() == 'href':
self.insideTag = 1
self.lastHref = value
def handle_data(self, data):
if self.insideTag:
self.hrefText = data
def end_a(self):
self.links.append((self.lastHref, self.hrefText))
self.insideTag = 0
Use it like this:
data = open('somefile.html').read()
links = HtmlParser().parseLinks(data)
-Dave
More information about the Python-list
mailing list