[Tutor] SGML-Parser: was finding title tag

Charlie Clark Charlie@begeistert.org
Wed, 24 Oct 2001 23:29:53 +0200


>Dear Samir,  
>  
>I'm slightly busy at the moment, but I'll be able to answer your  
question  
>tonight.  I'm forwarding this to the other tutors on the mailing list,  

so  
>that someone has a chance to answer you.  Best of wishes!  
>  
>  
>---------- Forwarded message ----------  
>Date: Tue, 16 Oct 2001 17:22:14 +1000 (EST)  
>From: Samir Patel <sampatel@cs.rmit.edu.au>  
>To: Danny Yoo <dyoo@hkn.eecs.berkeley.edu>  
>Subject: Re: [Tutor] finding title tag  
>  
>hi,  
>thanx a lot for the help...  
>i am confused with the use of class ....as i want to find links at the  

>depth of 3 or more how can i recursively call this method.....  
>  
>also in following soln of yours it will separate links and title for  
that  
>links ..so i will not be able to keep track of which title belongs to  
>which link....and also if there's no title then i have to get the text  

of  
>the link  
>  
Hello Samir, 
 
is this still acute? I read the message last week before I had to leave  

for München and run BeGeistert at the weekend. Back at my machine I  
feel I owe it to the list and Danny in particular to try and help you  
out: it took me a while to get used to using the HTML-Parser but I've  
since then been parsing HTML like a maniac... 
 
I've never come across "title" as an attribute in an anchor tag. As  
<title> is a tag, I think "title" is probably not allowed... but there's 
 
lots of sub-standard HTML out there. I've had to collect lists of links  

and like you I need to be able to keep various attributes together. The  

solution is quite simply to put the attributes into a dictionary and put 
 
the dictionary in a list. It only needs some slight modifications to the 
 
code 
 
from sgmllib import SGMLParser  
  
class AnchorParser(SGMLParser):  
    """This class pays attention to anchor tags.  Once we feed() a  
    document into an AnchorParser, we'd have the hrefs in the  
    'anchorlist' attribute, and the titles in the 'titlelist'  
    attribute."""  
    def __init__(self):  
        SGMLParser.__init__(self)  
        self.anchor =  {'link'.'', 'title':'}  # this a dictionary for  
each anchor 
        self.anchorlist = [] 
         
    def start_a(self, attributes):  
        """For each anchor tag, pay attention to the href and title  
        attributes."""  
        href, title = '', ''  
        for name, value in attributes:  
            if name == 'href': href = value  
            if name == 'title': title = value  
        self.anchor['link'] = href 
        self.anchor['title'] = title 
 
    def end_a(self):  
        self.anchorlist.append(self.anchor) # store the anchor in a list 

        self.anchor = {'url':'', 'title:''}	# reset the dictionary,  
ready for the next anchor 
 
Running this should work a bit like this: 
biglist =[] 
c = AnchorParser() 
c.feed(src) # where src is the url to be parsed 
c.close 
 
l = c.anchorlist 
biglist.append.append(l) 

print l
'[{'url':'http://www.begeistert.org', 'title':'some title'}] 
 
Looping over the list should be straightforward: 
for item in l: 
	src_url = item['url'] 
	src = urllib.urlopen(src_url)
	c = AnchorParser()

l = c.anchorlist 
biglist.append.append(l)

Put this into nice little functions and repeat as often as necessary. I 
haven't tested this but it shouldn't take much to get it to work. Can 
anyone tell me how to loop through the list with the new iterator function?

Charlie