how to get text between HTML tags with URLLIB??
Roy Katz
katz at Glue.umd.edu
Sat Aug 19 11:33:58 EDT 2000
On Sat, 19 Aug 2000, Alex Martelli wrote:
> By receiving the data and processing it at the end-tag. That's how
> sequential processing works; the architectural alternative to this
> sequential stream processing is building an in-memory object
> model for the document, the DOM approach, and it's heavier to
> do though more powerful.
>
Alright. I now have the following, as per your suggestion:
# book-keeping
class link_type:
def __init__( self, k ):
# garnered from NS bookmark fields
self.url=k[0][1]; self.vis=k[1][1];
self.lv=k[2][1]; self.lm=float(k[3][1]) # netscape uses long int
self.name = '' # title of the site
self.sxr = [] # where in the heirarchy this link is
# placeholders
self.inf = None; self.checklink_status=0; self.u=0
# override HTMLlib's start_a to hook in our registration function
class wcheck_parse( htmllib.HTMLParser ):
links = []
in_link = 0
strbuf = ''
linkbuf = None
def start_a( self, attrs ):
self.in_link = 1
self.linkbuf = link_type( attrs )
def end_a( self ):
self.in_link = 0
self.linkbuf.name = self.strbuf
self.links.append( self.linkbuf )
self.strbuf = ''
def handle_data( self, data ):
if self.in_link == 1:
self.strbuf = self.strbuf + data
This approach works fairly well; furthermore, the 'in_link' flag
ensures that strbuf will contain *only* the text between the <a href> and
</a> tags. There is a problem with this approach, however. I meant for
link_type.links to be a list of strings corresponding to the placement
of the link within the bookmark heirarchy; however, given Netscape's
bookmark format, I see that it will take me a lot more code than I
thought. I *am* building an in-memory model. So why re-invent the
wheel? You're right, I'll look at DOM, I just need a few examples of how
to use it effectively.
Roey
More information about the Python-list
mailing list