Extracting xml from html
Stefan Behnel
stefan.behnel-n05pAM at web.de
Tue Sep 18 02:56:30 EDT 2007
kyosohma at gmail.com wrote:
> I am attempting to extract some XML from an HTML document that I get
> returned from a form based web page. For some reason, I cannot figure
> out how to do this.
> Here's a sample of the html:
>
> <html>
> <body>
> lots of screwy text including divs and spans
> <Row status="o">
> <RecordNum>1126264</RecordNum>
> <Make>Mitsubishi</Make>
> <Model>Mirage DE</Model>
> </Row>
> </body>
> </html>
>
> What's the best way to get at the XML? Do I need to somehow parse it
> using the HTMLParser and then parse that with minidom or what?
lxml makes this pretty easy:
>>> parser = etree.HTMLParser()
>>> tree = etree.parse(the_file_or_url, parser)
This is actually a tree that can be treated as XML, e.g. with XPath, XSLT,
tree iteration, ... You will also get plain XML when you serialise it to XML:
>>> xml_string = etree.tostring(tree)
Note that this doesn't add any namespaces, so you will not magically get valid
XHTML or something. You could rewrite the tags by hand, though.
Stefan
More information about the Python-list
mailing list