beautifulsoup .vs tidy

Paul Boddie paul at boddie.org.uk
Sat Jul 1 12:43:53 EDT 2006


Ravi Teja wrote:
>
> 1.) XPath is not a good idea at all with "malformed" HTML or perhaps
> web pages in general.

import libxml2dom
import urllib
f = urllib.urlopen("http://wiki.python.org/moin/")
s = f.read()
f.close()
# s contains HTML not XML text
d = libxml2dom.parseString(s, html=1)
# get the community-related links
for label in d.xpath("//li[.//a/text() = 'Community']//li//a/text()"):
    print label.nodeValue

Of course, lxml should be able to do this kind of thing as well. I'd be
interested to know why this "is not a good idea", though.

Paul




More information about the Python-list mailing list