how to get text between HTML tags with URLLIB??

Roy Katz katz at Glue.umd.edu
Fri Aug 18 18:43:25 EDT 2000


Hello,

I'm writing a link-checker using urllib.  I'm using the following
technique to extract bookmark info from netscape bookmark files:

  # override HTMLlib's start_a to hook in our registration function
  class wcheck_parse( htmllib.HTMLParser ):
      links = []
      def start_a( self, attrs ):  self.links.append( link_type( attrs ) ) 


And then I use HTMLlib to open and parse the bookmark file:

  parser = wcheck_parse( formatter.NullFormatter() )
  parser.feed( open( bkfname ).read()  )
 

The problem lies in overriding start_a();  'attrs' will contain the url,
time visited, time created, etc., but it will *not* give me the text
between the HTML tags. So for example, for the following
link I will not get the text between the starting and closing href tags: 

 <a href=http://wacky.roey.com > 'Roey's Wacky Server of Fun!' </a href>

This is really frustrating.  Why isn't this mentioned in the urllib docs?
Same thing for the text between the <DT>...</DT> tags (or for any other
tags!)  Bottom line:  how do I get the text between html tags?


Thanks!

Roey Katz
katz at wam dot umd dot edu


deranged pythoneer




More information about the Python-list mailing list