Processing XML that's embedded in HTML
Stefan Behnel
stefan.behnel-n05pAM at web.de
Wed Jan 23 06:33:56 EST 2008
Hi,
Mike Driscoll wrote:
> I got lxml to create a tree by doing the following:
>
> from lxml import etree
> from StringIO import StringIO
>
> parser = etree.HTMLParser()
> tree = etree.parse(filename, parser)
> xml_string = etree.tostring(tree)
> context = etree.iterparse(StringIO(xml_string))
No idea why you need the two steps here. lxml 2.0 supports parsing HTML in
iterparse() directly when you pass the boolean "html" keyword.
> However, when I iterate over the contents of "context", I can't figure
> out how to nab the row's contents:
>
> for action, elem in context:
> if action == 'end' and elem.tag == 'relationship':
> # do something...but what!?
> # this if statement probably isn't even right
I would really encourage you to use the normal parser here instead of iterparse().
from lxml import etree
parser = etree.HTMLParser()
# parse the HTML/XML melange
tree = etree.parse(filename, parser)
# if you want, you can construct a pure XML document
row_root = etree.Element("newroot")
for row in tree.iterfind("//Row"):
row_root.append(row)
In your specific case, I'd encourage using lxml.objectify:
http://codespeak.net/lxml/dev/objectify.html
It will allow you to do this (untested):
from lxml import etree, objectify
parser = etree.HTMLParser()
lookup = objectify.ObjectifyElementClassLookup()
parser.setElementClassLookup(lookup)
tree = etree.parse(filename, parser)
for row in tree.iterfind("//Row"):
print row.relationship, row.StartDate, row.Priority * 2.7
Stefan
More information about the Python-list
mailing list