[Tutor] Help with Parsing HTML files

Charlie Clark Charlie Clark <charlie@begeistert.org>
Fri, 03 Aug 2001 14:24:02 +0200


>I had a look at the archives, I don't see how to search it,
>only to browse a month of postings at a time.

I couldn't find a search button either so I downloaded the last three months 
and searched them in my editor. It was in the merry month of may that this 
was last up for discussion and Remco posted the following snippet:

import htmllib

class ImgFinder(htmllib.HTMLParser):
   def __init__(self):
      # Normal HTMLParser takes a 'formatter' argument but we don't need it
      htmllib.HTMLParser.__init__(self, None)
   def handle_image(self, source, alt, *args):
      # *args holds "ismap", "align", "width" and "height", if available,
      # but we ignore those here
      print "Found an image!"
      print "Source =", source, "Alt =", alt

finder = ImgFinder()
finder.feed(aa)   # Feed in the string, it should find the images

Now I need an idiot proof user guide for this. We create an instance of 
ImgFinder which is based on htmllib.HTMLParser. When we construct an instance 
of this class we format to None. We add a method "handle_image" which has 
some nice arguments and prints "Found an image!" together with the image 
source and alt tag. "finder" is an instance of our class and gets fed a 
string - I assume this would be the contents of an HTML file.

What I don't see is how the handle_image function/method looks for images and 
I need to learn how to use this in order to modify it for my own dark 
purposes! Please help.

Charlie