html parser?

Tue Oct 18 13:32:28 EDT 2005

Thorsten Kampe wrote:
> For simple things like that "BeautifulSoup" might be overkill.

[HTMLParser example]

I've used SGMLParser with some success before, although the SAX-style
processing is objectionable to many people. One alternative is to use
libxml2dom [1] and to parse documents as HTML:

import libxml2dom, urllib
url = 'http://www.python.org'
doc = libxml2dom.parse(urllib.urlopen(url), html=1)
anchors = doc.xpath("//a")

Currently, the parseURI function in libxml2dom doesn't do HTML parsing,
mostly because I haven't yet figured out what combination of parsing
options have to be set to make it happen, but a combination of urllib
and libxml2dom should perform adequately. In the above example, you'd
process the nodes in the anchors list to get the desired results.

Paul

[1] http://www.python.org/pypi/libxml2dom