Processing XML that's embedded in HTML

John Machin sjmachin at lexicon.net
Tue Jan 22 16:42:13 EST 2008


On Jan 23, 7:48 am, Mike Driscoll <kyoso... at gmail.com> wrote:
[snip]

> I'm not sure what is wrong here...but I got lxml to create a tree from
> by doing the following:
>
> <code>
> from lxml import etree
> from StringIO import StringIO
>
> parser = etree.HTMLParser()
> tree = etree.parse(filename, parser)
> xml_string = etree.tostring(tree)
> context = etree.iterparse(StringIO(xml_string))
> </code>
>
> However, when I iterate over the contents of "context", I can't figure
> out how to nab the row's contents:
>
> for action, elem in context:
>     if action == 'end' and elem.tag == 'relationship':
>         # do something...but what!?
>         # this if statement probably isn't even right
>

lxml allegedly supports the ElementTree interface so I would expect
elem.text to refer to the contents. Sure enough:
http://codespeak.net/lxml/tutorial.html#elements-contain-text

Why do you want/need to use the iterparse technique on the 2nd pass
instead of creating another tree and then using getiterator?



More information about the Python-list mailing list