[Tutor] Help with Parsing HTML files

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Sat, 4 Aug 2001 12:39:21 -0700 (PDT)


On Fri, 3 Aug 2001, Sean 'Shaleh' Perry wrote:

> > What I don't see is how the handle_image function/method looks for
> images and > I need to learn how to use this in order to modify it for
> my own dark > purposes! Please help. >

Maybe an example will help --- Here's a small example that, given a web
site, tries to pull out all the image names.  (Rob, here's another useless
python script.  *grin*)

This example will use htmllib to help us "parse" and hunt down IMG tags.  
I don't think we need to explicitly rewrite handle_image().  For HTML
elements that have start and end tags, let's define
"start_nameofsometag()" and "end_nameofsometag()" methods.  However, since
an IMG tag stands alone, we'll write a do_img() method instead.



###
import htmllib
import formatter
import sys
import urllib

class ImagePuller(htmllib.HTMLParser):
    def __init__(self):
        htmllib.HTMLParser.__init__(self,
                                    formatter.NullFormatter())
        self.list_of_images = []

    def do_img(self, attributes):
        for name, value in attributes:
            if name == 'src':
                new_image = value
                self.list_of_images.append(new_image)

    def getImageList(self):
        return self.list_of_images

if __name__ == '__main__':
    url = sys.argv[1]
    url_contents = urllib.urlopen(url).read()
    puller = ImagePuller()
    puller.feed(url_contents)
    print puller.getImageList()
###


For more information about this, take a look at:

    http://python.org/doc/current/lib/module-sgmllib.html