Extracting xml from html
Laurent Pointal
laurent.pointal at limsi.fr
Wed Sep 19 04:06:51 EDT 2007
kyosohma at gmail.com a écrit :
> On Sep 18, 1:56 am, Stefan Behnel <stefan.behnel-n05... at web.de> wrote:
>> kyoso... at gmail.com wrote:
>>> I am attempting to extract some XML from an HTML document that I get
>>> returned from a form based web page. For some reason, I cannot figure
>>> out how to do this.
>>> Here's a sample of the html:
>>> <html>
>>> <body>
>>> lots of screwy text including divs and spans
>>> <Row status="o">
>>> <RecordNum>1126264</RecordNum>
>>> <Make>Mitsubishi</Make>
>>> <Model>Mirage DE</Model>
>>> </Row>
>>> </body>
>>> </html>
>>> What's the best way to get at the XML? Do I need to somehow parse it
>>> using the HTMLParser and then parse that with minidom or what?
>> lxml makes this pretty easy:
>>
>> >>> parser = etree.HTMLParser()
>> >>> tree = etree.parse(the_file_or_url, parser)
>>
>> This is actually a tree that can be treated as XML, e.g. with XPath, XSLT,
>> tree iteration, ... You will also get plain XML when you serialise it to XML:
>>
>> >>> xml_string = etree.tostring(tree)
>>
>> Note that this doesn't add any namespaces, so you will not magically get valid
>> XHTML or something. You could rewrite the tags by hand, though.
>>
>> Stefan
>
> I got it to work with lxml. See below:
>
> def Parser(filename):
> parser = etree.HTMLParser()
> tree = etree.parse(r'path/to/nextpage.htm', parser)
> xml_string = etree.tostring(tree)
> events = ("recordnum", "primaryowner", "customeraddress")
> context = etree.iterparse(StringIO(xml_string), tag='')
> for action, elem in context:
> tag = elem.tag
> if tag == 'primaryowner':
> owner = elem.text
> elif tag == 'customeraddress':
> address = elem.text
> else:
> pass
>
> print 'Primary Owner: %s' % owner
> print 'Address: %s' % address
>
> Does this make sense? It works pretty well, but I don't really
> understand everything that I'm doing.
>
> Mike
>
Q? Once you get your document into an XML tree in memory, while do you
go to event-based handling to extract your data ?
Try to directly manipulate the tree.
parser = etree.HTMLParser()
tree = etree.parse(r'path/to/nextpage.htm', parser)
myrows = tree.findall(".//Row")
# Then work with the sub-elements.
for r in myrows :
rnumelem = r.find("RecordNum")
makeeleme = r.find("Make")
modelelem = r.find("Model")
& co.
More information about the Python-list
mailing list