how to get text between HTML tags with URLLIB??

Roy Katz katz at Glue.umd.edu
Sun Aug 20 12:22:49 EDT 2000


Alright, I re-worked the classes as per your suggestion.  Interesting
note: the documentation for save_bgn and save_end indicate that
nested execution is prohibited.  What is that all about? the following
code works fine (albeit written sloppily)!


Thanks!
Roey




 
 
  # book-keeping
  class link_type:
    def __init__( self, k ):
 
        # garnered from NS bookmark fields
        self.url=k[0][1]; self.vis=k[1][1];
        self.lv=k[2][1];  self.lm=float(k[3][1])  # netscape uses long int
 
        self.name       = ''    # title of the site
        self.linkstack  = []    # where in the heirarchy this link is
 
        # placeholders
        self.inf  = None;   self.checklink_status=0;  self.u=0                  

 
  # override HTMLlib's start_a to hook in our registration function
  class wcheck_parse( htmllib.HTMLParser ):
    links     = []
    linkbuf   = None
    linkstack = []
 

    def start_h3( self, attrs ):
        self.save_bgn()
 
    def end_h3( self ):
        self.linkstack.append( self.save_end() )
 
    def end_dl( self ):
        if( len(self.linkstack) > 0 ): self.linkstack.pop()
 
    def start_a( self, attrs ):
        self.save_bgn()
        self.linkbuf = link_type( attrs )
 
    def end_a( self ):
        self.linkbuf.name = self.save_end()
        self.linkbuf.linkstack  = self.linkstack[:]
        self.links.append( self.linkbuf )
                                                                                



On Sun, 20 Aug 2000, Paolo G. Cantore wrote:

> Your in_link processing is already provided by the two parser-methods 
> save_bgn() and save_end(). Your code would look like:
> 
> def start_a(self, attrs):
> 	self.save_bgn()
> 	self.linkbuf=link_type(attrs)
> 
> def end_a(self):
> 	self.linkbuf.name=self.save_end()
> 	self.links.append(self.linkbuf)
> 
> that's all
> --
> 




More information about the Python-list mailing list