Processing XML that's embedded in HTML

Wed Jan 23 06:33:56 EST 2008

Hi,

Mike Driscoll wrote:
> I got lxml to create a tree by doing the following:
> 
> from lxml import etree
> from StringIO import StringIO
> 
> parser = etree.HTMLParser()
> tree = etree.parse(filename, parser)
> xml_string = etree.tostring(tree)
> context = etree.iterparse(StringIO(xml_string))

No idea why you need the two steps here. lxml 2.0 supports parsing HTML in
iterparse() directly when you pass the boolean "html" keyword.

> However, when I iterate over the contents of "context", I can't figure
> out how to nab the row's contents:
> 
> for action, elem in context:
>     if action == 'end' and elem.tag == 'relationship':
>         # do something...but what!?
>         # this if statement probably isn't even right

I would really encourage you to use the normal parser here instead of iterparse().

  from lxml import etree
  parser = etree.HTMLParser()

  # parse the HTML/XML melange
  tree = etree.parse(filename, parser)

  # if you want, you can construct a pure XML document
  row_root = etree.Element("newroot")
  for row in tree.iterfind("//Row"):
      row_root.append(row)

In your specific case, I'd encourage using lxml.objectify:

http://codespeak.net/lxml/dev/objectify.html

It will allow you to do this (untested):

  from lxml import etree, objectify
  parser = etree.HTMLParser()
  lookup = objectify.ObjectifyElementClassLookup()
  parser.setElementClassLookup(lookup)

  tree = etree.parse(filename, parser)

  for row in tree.iterfind("//Row"):
      print row.relationship, row.StartDate, row.Priority * 2.7

Stefan