how to get text between HTML tags with URLLIB??
Roy Katz
katz at Glue.umd.edu
Sun Aug 20 12:22:49 EDT 2000
Alright, I re-worked the classes as per your suggestion. Interesting
note: the documentation for save_bgn and save_end indicate that
nested execution is prohibited. What is that all about? the following
code works fine (albeit written sloppily)!
Thanks!
Roey
# book-keeping
class link_type:
def __init__( self, k ):
# garnered from NS bookmark fields
self.url=k[0][1]; self.vis=k[1][1];
self.lv=k[2][1]; self.lm=float(k[3][1]) # netscape uses long int
self.name = '' # title of the site
self.linkstack = [] # where in the heirarchy this link is
# placeholders
self.inf = None; self.checklink_status=0; self.u=0
# override HTMLlib's start_a to hook in our registration function
class wcheck_parse( htmllib.HTMLParser ):
links = []
linkbuf = None
linkstack = []
def start_h3( self, attrs ):
self.save_bgn()
def end_h3( self ):
self.linkstack.append( self.save_end() )
def end_dl( self ):
if( len(self.linkstack) > 0 ): self.linkstack.pop()
def start_a( self, attrs ):
self.save_bgn()
self.linkbuf = link_type( attrs )
def end_a( self ):
self.linkbuf.name = self.save_end()
self.linkbuf.linkstack = self.linkstack[:]
self.links.append( self.linkbuf )
On Sun, 20 Aug 2000, Paolo G. Cantore wrote:
> Your in_link processing is already provided by the two parser-methods
> save_bgn() and save_end(). Your code would look like:
>
> def start_a(self, attrs):
> self.save_bgn()
> self.linkbuf=link_type(attrs)
>
> def end_a(self):
> self.linkbuf.name=self.save_end()
> self.links.append(self.linkbuf)
>
> that's all
> --
>
More information about the Python-list
mailing list