Extracting xml from html

Stefan Behnel stefan.behnel-n05pAM at web.de
Wed Sep 19 04:13:24 EDT 2007


kyosohma at gmail.com wrote:
> Does this make sense? It works pretty well, but I don't really
> understand everything that I'm doing.
>
> def Parser(filename):

It's uncommon to give a function a capitalised name, unless it's a factory
function (which this isn't).


>     parser = etree.HTMLParser()
>     tree = etree.parse(r'path/to/nextpage.htm', parser)
>     xml_string = etree.tostring(tree)

What you do here is parse the HTML page and serialise it back into an XML
string. No need to do that - once it's a tree, you can work with it. lxml is a
highly integrated set of tools, no matter if you use it for XML or HTML.


>     events = ("recordnum", "primaryowner", "customeraddress")

You're not using this anywhere below, so I assume this is left-over code.


>     context = etree.iterparse(StringIO(xml_string), tag='')
>     for action, elem in context:
> 	tag = elem.tag
> 	if tag == 'primaryowner':
>             owner = elem.text
>         elif tag == 'customeraddress':
>             address = elem.text
>         else:
>             pass
> 
>     print 'Primary Owner: %s' % owner
>     print 'Address: %s' % address

Admittedly, iterparse() doesn't currently support HTML (although this might
become possible in lxml 2.0).

You could do this more easily in a couple of ways. One is to use XPath:

   print [el.text for el in tree.xpath("//primaryowner|//customeraddress")]

Note that this works directly on the tree that you retrieved right in the
third line of your code.

Another (and likely simpler) solution is to first find the "Row" element and
then start from that:

   row = tree.find("//Row")
   print row.findtext("primaryowner")
   print row.findtext("customeraddress")

See the lxml tutorial on this, as well as the documentation on XPath support
and tree iteration:

http://codespeak.net/lxml/xpathxslt.html#xpath
http://codespeak.net/lxml/api.html#iteration

Hope this helps,
Stefan



More information about the Python-list mailing list