how to get text between HTML tags with URLLIB??

Roy Katz katz at Glue.umd.edu
Sat Aug 19 11:33:58 EDT 2000


On Sat, 19 Aug 2000, Alex Martelli wrote:

> By receiving the data and processing it at the end-tag.  That's how
> sequential processing works; the architectural alternative to this
> sequential stream processing is building an in-memory object
> model for the document, the DOM approach, and it's heavier to
> do though more powerful.
> 

Alright.  I now have the following, as per your suggestion:

  # book-keeping
  class link_type:
    def __init__( self, k ):
 
        # garnered from NS bookmark fields
        self.url=k[0][1]; self.vis=k[1][1];
        self.lv=k[2][1];  self.lm=float(k[3][1])  # netscape uses long int
 
        self.name = ''          # title of the site
        self.sxr  = []          # where in the heirarchy this link is
 
        # placeholders
        self.inf = None;  self.checklink_status=0;  self.u=0
 
  # override HTMLlib's start_a to hook in our registration function
  class wcheck_parse( htmllib.HTMLParser ):
    links   = []
    in_link = 0
    strbuf  = ''
    linkbuf = None
 
    def start_a( self, attrs ):
        self.in_link = 1
        self.linkbuf = link_type( attrs )
 
    def end_a( self ):
        self.in_link = 0
        self.linkbuf.name = self.strbuf
        self.links.append( self.linkbuf )
        self.strbuf = ''
 
    def handle_data( self, data ):
        if self.in_link == 1:
            self.strbuf = self.strbuf + data                                    


This approach works fairly well;  furthermore, the 'in_link' flag
ensures that strbuf will contain *only* the text between the <a href> and
</a> tags.  There is a problem with this approach, however.  I meant for 
link_type.links to be a list of strings corresponding to the placement
of the link within the bookmark heirarchy; however, given Netscape's 
bookmark format, I see that it will take me a lot more code than I
thought.  I *am* building an in-memory model.  So why re-invent the
wheel? You're right, I'll look at DOM, I just need a few examples of how
to use it effectively. 



Roey





More information about the Python-list mailing list