[Tutor] finding title tag

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Mon, 15 Oct 2001 11:53:50 -0700 (PDT)


On Mon, 15 Oct 2001, Samir Patel wrote:

> i have written a program which takes in the url at command line
> argument and finds the links present in that particular web page and
> this continues to a depth of 3 or more.....

> i store this links in a universal list ..now my problem is that i want
> to find if there's any title tag present in this link e.g
> 
> <a href= "this is url" title = "i want this">or this</a>

If we're using one of the parsers from the standard library, like
HTMLParser or SGMLParser, this isn't too hard --- for each tag that the 
parser encounters, the parser will give us a list of attributes.



> def findlinks(url):
>    try:
>        fp = urllib.urlopen(url)
>    except IOError:
>        return []
>    results = []
> 
>    p = HTMLParser(NullFormatter())
>    p.feed(fp.read())
> 
>    return p.anchorlist    # return the list of lines which have a link


htmllib.HTMLParser is a little more specialized as a parser than
sgmllib.SGMLParser --- from the documentation, the HTMLParser only expects
'name' and 'type' attributes from the anchors tags.  I think it might be
too specialized for the task, as it doesn't pay attention to the titles of
anchors.

(At the same time, should the anchors have titles in the first place?  Is
this standard HTML?)


It might be best to write our own parser to handle both the list of
anchors and the list of titles.  Here's one parser that should do the job:

###
from sgmllib import SGMLParser

class AnchorParser(SGMLParser):
    """This class pays attention to anchor tags.  Once we feed() a
    document into an AnchorParser, we'd have the hrefs in the
    'anchorlist' attribute, and the titles in the 'titlelist'
    attribute."""
    def __init__(self):
        SGMLParser.__init__(self)
        self.anchorlist = []
        self.titlelist = []

    def start_a(self, attributes):
        """For each anchor tag, pay attention to the href and title
        attributes."""
        href, title = '', ''
        for name, value in attributes:
            if name == 'href': href = value
            if name == 'title': title = value
        self.anchorlist.append(href)
        self.titlelist.append(title)

    def end_a(self):
        pass
###



If we have something like AnchorParser, we can write a
findtitles() function that looks very similar to your findlinks():


###
def findlinks(url):
   try:
       fp = urllib.urlopen(url)
   except IOError:
       return []
   p = AnchorParser()
   p.feed(fp.read())
   return p.titlelist
###


If you're doing to do a lot of parsing, it might be a good idea to read
more about sgmllib.SGMLParser: 

    http://www.python.org/doc/lib/module-sgmllib.html

Feel free to ask more questions.  Good luck to you!