how to get text between HTML tags with URLLIB??
Roy Katz
katz at Glue.umd.edu
Fri Aug 18 18:43:25 EDT 2000
Hello,
I'm writing a link-checker using urllib. I'm using the following
technique to extract bookmark info from netscape bookmark files:
# override HTMLlib's start_a to hook in our registration function
class wcheck_parse( htmllib.HTMLParser ):
links = []
def start_a( self, attrs ): self.links.append( link_type( attrs ) )
And then I use HTMLlib to open and parse the bookmark file:
parser = wcheck_parse( formatter.NullFormatter() )
parser.feed( open( bkfname ).read() )
The problem lies in overriding start_a(); 'attrs' will contain the url,
time visited, time created, etc., but it will *not* give me the text
between the HTML tags. So for example, for the following
link I will not get the text between the starting and closing href tags:
<a href=http://wacky.roey.com > 'Roey's Wacky Server of Fun!' </a href>
This is really frustrating. Why isn't this mentioned in the urllib docs?
Same thing for the text between the <DT>...</DT> tags (or for any other
tags!) Bottom line: how do I get the text between html tags?
Thanks!
Roey Katz
katz at wam dot umd dot edu
deranged pythoneer
More information about the Python-list
mailing list