Extracting xml from html

Mon Sep 17 17:51:23 EDT 2007

En Mon, 17 Sep 2007 17:31:19 -0300, <kyosohma at gmail.com> escribi�:

> I am attempting to extract some XML from an HTML document that I get
> returned from a form based web page. For some reason, I cannot figure
> out how to do this. I thought I could use the minidom module to do it,
> but all I get is a screwy traceback:
>
> Traceback (most recent call last):
>   File "C:\Python24\lib\xml\dom\expatbuilder.py", line 207, in
> parseFile
>     parser.Parse(buffer, 0)
> ExpatError: mismatched tag: line 1, column 357

So your HTML is not a well formed XML document, as many html pages, and  
you can't use an XML parser. (even a valid HTML document may not be valid  
XML). Let's try with some mismatched tags:

py> text = '''<html>
... <body>
... <p>lots of <div>screwy text including divs and <span>spans</p>
... <Row status="o">
...     <RecordNum>1126264</RecordNum>
...     <Make>Mitsubishi</Make>
...     <Model>Mirage DE</Model>
... </Row>
... </body>
... </html>'''
py>
py> import xml.dom.minidom
py> doc = xml.dom.minidom.parseString(text)
Traceback (most recent call last):
...
xml.parsers.expat.ExpatError: mismatched tag: line 3, column 60

You will need a more robust parser, like BeautifulSoup  
<http://www.crummy.com/software/BeautifulSoup/>

py> from BeautifulSoup import BeautifulSoup
py> soup = BeautifulSoup(text)
py> for row in soup.findAll("row"):
...   print row.recordnum, row.make.contents, row.model.string
...
<recordnum>1126264</recordnum> [u'Mitsubishi'] Mirage DE

Depending on your document, you may prefer to extract the XML blocks using  
BeautifulSoup, and then parse each one using BeautifulStoneSoup (the XML  
parser) or xml.etree.ElementTree

-- 
Gabriel Genellina