Extracting xml from html

Stefan Behnel stefan.behnel-n05pAM at web.de
Tue Sep 18 02:56:30 EDT 2007


kyosohma at gmail.com wrote:
> I am attempting to extract some XML from an HTML document that I get
> returned from a form based web page. For some reason, I cannot figure
> out how to do this.
> Here's a sample of the html:
> 
> <html>
> <body>
> lots of screwy text including divs and spans
> <Row status="o">
>     <RecordNum>1126264</RecordNum>
>     <Make>Mitsubishi</Make>
>     <Model>Mirage DE</Model>
> </Row>
> </body>
> </html>
> 
> What's the best way to get at the XML? Do I need to somehow parse it
> using the HTMLParser and then parse that with minidom or what?

lxml makes this pretty easy:

   >>> parser = etree.HTMLParser()
   >>> tree = etree.parse(the_file_or_url, parser)

This is actually a tree that can be treated as XML, e.g. with XPath, XSLT,
tree iteration, ... You will also get plain XML when you serialise it to XML:

   >>> xml_string = etree.tostring(tree)

Note that this doesn't add any namespaces, so you will not magically get valid
XHTML or something. You could rewrite the tags by hand, though.

Stefan



More information about the Python-list mailing list