HTMLParser bug?

Matt Gushee mgushee at havenrock.com
Thu Dec 30 14:22:49 EST 1999


Happy Almost Y2K, Everybody--

I'm working on a web-related application, and I wrote a Webhandler
class which retrieves and parses web pages. Among other things, it is
supposed to return to the main application a list of links from the
current page.

I'm using htmllib.HTMLParser instantiated like this:

	f = AbstractFormatter(NullWriter())
	self.parser = HTMLParser(f)

At first I tried creating the parser instance in my __init__ method,
but I ran into trouble because the parser seems to preserve data
between invocations, even if I call the reset() method -- so that,
when my parsing function has to construct absolute URLs from relative
ones, it often puts old paths (i.e., leftover data in
self.parser.anchorlist) together with new hostnames.

The problem goes away if I create a new parser instance for every
page, but I wanted to avoid that if I could. Is this a bug, or have I
misunderstood how to use htmllib?

-- 
Matt Gushee
Portland, Maine, USA
mgushee at havenrock.com
http://www.havenrock.com/



More information about the Python-list mailing list