Parsing HTML--looking for info/comparison of HTMLParser vs. htmllib modules.

Sat Jul 8 03:10:59 EDT 2006

Kenneth McDonald wrote:

> The problem I'm having with HTMLParser is simple; I don't seem to be 
> getting the actual text in the HTML document. I've implemented the 
> do_data method of HTMLParser.HTMLParser in my HTMLParser subclass, but 
> it never seems to receive any data. Is there another way to access the 
> text chunks as they come along?

the method is called "handle_data":

     http://docs.python.org/lib/module-HTMLParser.html

> HTMLParser would probably be the way to go if I can figure this out. It 
> seems much simpler than htmllib, and satisfies my requirements.
> 
> htmllib will write out the text data (using the AbstractFormatter and 
> AbstractWriter), but my problem here is conceptual. I simply don't 
> understand why all of these different "levels" of abstractness are 
> necessary, nor how to use them.

if you're not interested in HTML *rendering*, use sgmllib instead.

     http://docs.python.org/lib/module-sgmllib.html

the only difference between the libs is that HTMLParser is a bit 
stricter; on the other hand, if you want to parse really messy HTML, you 
should probably use BeautifulSoup instead:

     http://www.crummy.com/software/BeautifulSoup/

</F>