Parsing HTML?

Paul Boddie paul at boddie.org.uk
Thu Apr 3 09:11:40 EDT 2008


On 3 Apr, 06:59, Benjamin <ben... at gmail.com> wrote:
> I'm trying to parse an HTML file.  I want to retrieve all of the text
> inside a certain tag that I find with XPath.  The DOM seems to make
> this available with the innerHTML element, but I haven't found a way
> to do it in Python.

With libxml2dom you'd do the following:

 1. Parse the file using libxml2dom.parse with html set to a true
value.
 2. Use the xpath method on the document to select the desired
element.
 3. Use the toString method on the element to get the text of the
    element (including start and end tags), or the textContent
property
    to get the text between the tags.

See the Package Index page for more details:

  http://www.python.org/pypi/libxml2dom

Paul



More information about the Python-list mailing list