Parsing HTML--looking for info/comparison of HTMLParser vs. htmllib modules.
Fredrik Lundh
fredrik at pythonware.com
Sat Jul 8 03:10:59 EDT 2006
Kenneth McDonald wrote:
> The problem I'm having with HTMLParser is simple; I don't seem to be
> getting the actual text in the HTML document. I've implemented the
> do_data method of HTMLParser.HTMLParser in my HTMLParser subclass, but
> it never seems to receive any data. Is there another way to access the
> text chunks as they come along?
the method is called "handle_data":
http://docs.python.org/lib/module-HTMLParser.html
> HTMLParser would probably be the way to go if I can figure this out. It
> seems much simpler than htmllib, and satisfies my requirements.
>
> htmllib will write out the text data (using the AbstractFormatter and
> AbstractWriter), but my problem here is conceptual. I simply don't
> understand why all of these different "levels" of abstractness are
> necessary, nor how to use them.
if you're not interested in HTML *rendering*, use sgmllib instead.
http://docs.python.org/lib/module-sgmllib.html
the only difference between the libs is that HTMLParser is a bit
stricter; on the other hand, if you want to parse really messy HTML, you
should probably use BeautifulSoup instead:
http://www.crummy.com/software/BeautifulSoup/
</F>
More information about the Python-list
mailing list