HTML Parsing

Tue Jan 2 16:49:22 EST 2001

"Denis Voitenko" <richmedium at mediaone.net> wrote in message
news:Vvq46.6762$ca.70091 at typhoon.jacksonville.mediaone.net...
> I am trying to do some HTML parsing with htmllib but not getting anywhere.
> Can someone give a couple of basic examples?

Classic basic example #1: list links from an HTML page that
you have in a diskfile:

import htmllib
import formatter

parser=htmllib.HTMLParser(formatter.NullFormatter())
parser.feed(open('thepage.html').read())
parser.close()

print parser.anchorlist

Particularly easy, because links are specially processed by
the supplied HTML parser class, so you need no inheriting
and overriding for customization purposes.  But, when you
do, things aren't that much harder.  E.g., say that what you
need are the SRC urls of all IMG tags on the page...:

import htmllib
import formatter

class MyParser(htmllib.HTMLParser):
    def __init__(self, formatterObject):
        htmllib.HTMLParser.__init__(self, formatterObject)
        self.image_sources = []
    def do_img(self, attributes):
        for name,value in attributes:
            if name=='src':
                self.image_sources.append(value)

parser=MyParser(formatter.NullFormatter())
parser.feed(open('thepage.html').read())
parser.close()

print parser.image_sources

Since the <IMG> tag does not require a closing tag, it is
handled by a method called do_img (not start_img and
end_img, for opening and closing tags, as others would
require).  The attributes argument is a list of name/value
pairs (2-element tuples), so we just loop on it in our
subclass's overriding do_img method to identify the SRC
attribute[s] and record their values.

Alex