Extracting xml from html

Tue Sep 18 02:56:30 EDT 2007

kyosohma at gmail.com wrote:
> I am attempting to extract some XML from an HTML document that I get
> returned from a form based web page. For some reason, I cannot figure
> out how to do this.
> Here's a sample of the html:
> 
> <html>
> <body>
> lots of screwy text including divs and spans
> <Row status="o">
>     <RecordNum>1126264</RecordNum>
>     <Make>Mitsubishi</Make>
>     <Model>Mirage DE</Model>
> </Row>
> </body>
> </html>
> 
> What's the best way to get at the XML? Do I need to somehow parse it
> using the HTMLParser and then parse that with minidom or what?

lxml makes this pretty easy:

   >>> parser = etree.HTMLParser()
   >>> tree = etree.parse(the_file_or_url, parser)

This is actually a tree that can be treated as XML, e.g. with XPath, XSLT,
tree iteration, ... You will also get plain XML when you serialise it to XML:

   >>> xml_string = etree.tostring(tree)

Note that this doesn't add any namespaces, so you will not magically get valid
XHTML or something. You could rewrite the tags by hand, though.

Stefan