[Tutor] SGML-Parser: was finding title tag
Charlie Clark
Charlie@begeistert.org
Wed, 24 Oct 2001 23:29:53 +0200
>Dear Samir,
>
>I'm slightly busy at the moment, but I'll be able to answer your
question
>tonight. I'm forwarding this to the other tutors on the mailing list,
so
>that someone has a chance to answer you. Best of wishes!
>
>
>---------- Forwarded message ----------
>Date: Tue, 16 Oct 2001 17:22:14 +1000 (EST)
>From: Samir Patel <sampatel@cs.rmit.edu.au>
>To: Danny Yoo <dyoo@hkn.eecs.berkeley.edu>
>Subject: Re: [Tutor] finding title tag
>
>hi,
>thanx a lot for the help...
>i am confused with the use of class ....as i want to find links at the
>depth of 3 or more how can i recursively call this method.....
>
>also in following soln of yours it will separate links and title for
that
>links ..so i will not be able to keep track of which title belongs to
>which link....and also if there's no title then i have to get the text
of
>the link
>
Hello Samir,
is this still acute? I read the message last week before I had to leave
for München and run BeGeistert at the weekend. Back at my machine I
feel I owe it to the list and Danny in particular to try and help you
out: it took me a while to get used to using the HTML-Parser but I've
since then been parsing HTML like a maniac...
I've never come across "title" as an attribute in an anchor tag. As
<title> is a tag, I think "title" is probably not allowed... but there's
lots of sub-standard HTML out there. I've had to collect lists of links
and like you I need to be able to keep various attributes together. The
solution is quite simply to put the attributes into a dictionary and put
the dictionary in a list. It only needs some slight modifications to the
code
from sgmllib import SGMLParser
class AnchorParser(SGMLParser):
"""This class pays attention to anchor tags. Once we feed() a
document into an AnchorParser, we'd have the hrefs in the
'anchorlist' attribute, and the titles in the 'titlelist'
attribute."""
def __init__(self):
SGMLParser.__init__(self)
self.anchor = {'link'.'', 'title':'} # this a dictionary for
each anchor
self.anchorlist = []
def start_a(self, attributes):
"""For each anchor tag, pay attention to the href and title
attributes."""
href, title = '', ''
for name, value in attributes:
if name == 'href': href = value
if name == 'title': title = value
self.anchor['link'] = href
self.anchor['title'] = title
def end_a(self):
self.anchorlist.append(self.anchor) # store the anchor in a list
self.anchor = {'url':'', 'title:''} # reset the dictionary,
ready for the next anchor
Running this should work a bit like this:
biglist =[]
c = AnchorParser()
c.feed(src) # where src is the url to be parsed
c.close
l = c.anchorlist
biglist.append.append(l)
print l
'[{'url':'http://www.begeistert.org', 'title':'some title'}]
Looping over the list should be straightforward:
for item in l:
src_url = item['url']
src = urllib.urlopen(src_url)
c = AnchorParser()
l = c.anchorlist
biglist.append.append(l)
Put this into nice little functions and repeat as often as necessary. I
haven't tested this but it shouldn't take much to get it to work. Can
anyone tell me how to loop through the list with the new iterator function?
Charlie