Parsing HTML?

Benjamin benash at gmail.com
Sat Apr 26 17:26:04 EDT 2008


On Apr 6, 11:03 pm, Stefan Behnel <stefan... at behnel.de> wrote:
> Benjamin wrote:
> > I'm trying to parse an HTML file.  I want to retrieve all of the text
> > inside a certain tag that I find with XPath.  The DOM seems to make
> > this available with the innerHTML element, but I haven't found a way
> > to do it in Python.
>
>     import lxml.html as h
>     tree = h.parse("somefile.html")
>     text = tree.xpath("string( some/element[@condition] )")
>
> http://codespeak.net/lxml
>
> Stefan

I actually had trouble getting this to work.  I guess only new version
of lxml have the html module, and I couldn't get it installed.  lxml
does look pretty cool, though.



More information about the Python-list mailing list