HTML parsing confusion
Gabriel Genellina
gagsl-py2 at yahoo.com.ar
Wed Jan 23 16:14:00 EST 2008
En Wed, 23 Jan 2008 10:40:14 -0200, Alnilam <alnilam at gmail.com> escribió:
> Skipping past html validation, and html to xhtml 'cleaning', and
> instead starting with the assumption that I have files that are valid
> XHTML, can anyone give me a good example of how I would use _ htmllib,
> HTMLParser, or ElementTree _ to parse out the text of one specific
> childNode, similar to the examples that I provided above using regex?
The diveintopython page is not valid XHTML (but it's valid HTML). Assuming
it's property converted:
py> from cStringIO import StringIO
py> import xml.etree.ElementTree as ET
py> tree = ET.parse(StringIO(page))
py> elem = tree.findall('//p')[4]
py>
py> # from the online ElementTree docs
py> http://www.effbot.org/zone/element-bits-and-pieces.htm
... def gettext(elem):
... text = elem.text or ""
... for e in elem:
... text += gettext(e)
... if e.tail:
... text += e.tail
... return text
...
py> print gettext(elem)
The complete text is available online. You can read the revision history
to see
what's new. Updated 20 May 2004
--
Gabriel Genellina
More information about the Python-list
mailing list