HTML parsing confusion

Wed Jan 23 16:14:00 EST 2008

En Wed, 23 Jan 2008 10:40:14 -0200, Alnilam <alnilam at gmail.com> escribió:

> Skipping past html validation, and html to xhtml 'cleaning', and
> instead starting with the assumption that I have files that are valid
> XHTML, can anyone give me a good example of how I would use _ htmllib,
> HTMLParser, or ElementTree _ to parse out the text of one specific
> childNode, similar to the examples that I provided above using regex?

The diveintopython page is not valid XHTML (but it's valid HTML). Assuming  
it's property converted:

py> from cStringIO import StringIO
py> import xml.etree.ElementTree as ET
py> tree = ET.parse(StringIO(page))
py> elem = tree.findall('//p')[4]
py>
py> # from the online ElementTree docs
py> http://www.effbot.org/zone/element-bits-and-pieces.htm
... def gettext(elem):
...     text = elem.text or ""
...     for e in elem:
...         text += gettext(e)
...         if e.tail:
...             text += e.tail
...     return text
...
py> print gettext(elem)
The complete text is available online.  You can read the revision history  
to see
  what's new. Updated 20 May 2004

-- 
Gabriel Genellina