Processing XML that's embedded in HTML
Mike Driscoll
kyosohma at gmail.com
Wed Jan 23 10:49:14 EST 2008
John and Stefan,
On Jan 23, 5:33 am, Stefan Behnel <stefan.behnel-n05... at web.de> wrote:
> Hi,
>
> Mike Driscoll wrote:
> > I got lxml to create a tree by doing the following:
>
> > from lxml import etree
> > from StringIO import StringIO
>
> > parser = etree.HTMLParser()
> > tree = etree.parse(filename, parser)
> > xml_string = etree.tostring(tree)
> > context = etree.iterparse(StringIO(xml_string))
>
> No idea why you need the two steps here. lxml 2.0 supports parsing HTML in
> iterparse() directly when you pass the boolean "html" keyword.
I don't know why I have 2 steps either, now that I look at it.
However, I don't do enough XML parsing to get real familiar with the
ins and outs of Python parsing either, so it's mainly just my
inexperience. And I also got lost in the lxml tutorials...
>
> > However, when I iterate over the contents of "context", I can't figure
> > out how to nab the row's contents:
>
> > for action, elem in context:
> > if action == 'end' and elem.tag == 'relationship':
> > # do something...but what!?
> > # this if statement probably isn't even right
>
> I would really encourage you to use the normal parser here instead of iterparse().
>
> from lxml import etree
> parser = etree.HTMLParser()
>
> # parse the HTML/XML melange
> tree = etree.parse(filename, parser)
>
> # if you want, you can construct a pure XML document
> row_root = etree.Element("newroot")
> for row in tree.iterfind("//Row"):
> row_root.append(row)
>
> In your specific case, I'd encourage using lxml.objectify:
>
> http://codespeak.net/lxml/dev/objectify.html
>
> It will allow you to do this (untested):
>
> from lxml import etree, objectify
> parser = etree.HTMLParser()
> lookup = objectify.ObjectifyElementClassLookup()
> parser.setElementClassLookup(lookup)
>
> tree = etree.parse(filename, parser)
>
> for row in tree.iterfind("//Row"):
> print row.relationship, row.StartDate, row.Priority * 2.7
>
> Stefan
I'll give your ideas a go and also see if what the others posted will
be cleaner or faster.
Thank you all.
Mike
More information about the Python-list
mailing list