Processing XML that's embedded in HTML

Mike Driscoll kyosohma at gmail.com
Wed Jan 23 10:49:14 EST 2008


John and Stefan,

On Jan 23, 5:33 am, Stefan Behnel <stefan.behnel-n05... at web.de> wrote:
> Hi,
>
> Mike Driscoll wrote:
> > I got lxml to create a tree by doing the following:
>
> > from lxml import etree
> > from StringIO import StringIO
>
> > parser = etree.HTMLParser()
> > tree = etree.parse(filename, parser)
> > xml_string = etree.tostring(tree)
> > context = etree.iterparse(StringIO(xml_string))
>
> No idea why you need the two steps here. lxml 2.0 supports parsing HTML in
> iterparse() directly when you pass the boolean "html" keyword.


I don't know why I have 2 steps either, now that I look at it.
However, I don't do enough XML parsing to get real familiar with the
ins and outs of Python parsing either, so it's mainly just my
inexperience. And I also got lost in the lxml tutorials...

>
> > However, when I iterate over the contents of "context", I can't figure
> > out how to nab the row's contents:
>
> > for action, elem in context:
> >     if action == 'end' and elem.tag == 'relationship':
> >         # do something...but what!?
> >         # this if statement probably isn't even right
>
> I would really encourage you to use the normal parser here instead of iterparse().
>
>   from lxml import etree
>   parser = etree.HTMLParser()
>
>   # parse the HTML/XML melange
>   tree = etree.parse(filename, parser)
>
>   # if you want, you can construct a pure XML document
>   row_root = etree.Element("newroot")
>   for row in tree.iterfind("//Row"):
>       row_root.append(row)
>
> In your specific case, I'd encourage using lxml.objectify:
>
> http://codespeak.net/lxml/dev/objectify.html
>
> It will allow you to do this (untested):
>
>   from lxml import etree, objectify
>   parser = etree.HTMLParser()
>   lookup = objectify.ObjectifyElementClassLookup()
>   parser.setElementClassLookup(lookup)
>
>   tree = etree.parse(filename, parser)
>
>   for row in tree.iterfind("//Row"):
>       print row.relationship, row.StartDate, row.Priority * 2.7
>
> Stefan

I'll give your ideas a go and also see if what the others posted will
be cleaner or faster.

Thank you all.

Mike



More information about the Python-list mailing list