Extracting xml from html

Laurent Pointal laurent.pointal at limsi.fr
Wed Sep 19 04:06:51 EDT 2007


kyosohma at gmail.com a écrit :
> On Sep 18, 1:56 am, Stefan Behnel <stefan.behnel-n05... at web.de> wrote:
>> kyoso... at gmail.com wrote:
>>> I am attempting to extract some XML from an HTML document that I get
>>> returned from a form based web page. For some reason, I cannot figure
>>> out how to do this.
>>> Here's a sample of the html:
>>> <html>
>>> <body>
>>> lots of screwy text including divs and spans
>>> <Row status="o">
>>>     <RecordNum>1126264</RecordNum>
>>>     <Make>Mitsubishi</Make>
>>>     <Model>Mirage DE</Model>
>>> </Row>
>>> </body>
>>> </html>
>>> What's the best way to get at the XML? Do I need to somehow parse it
>>> using the HTMLParser and then parse that with minidom or what?
>> lxml makes this pretty easy:
>>
>>    >>> parser = etree.HTMLParser()
>>    >>> tree = etree.parse(the_file_or_url, parser)
>>
>> This is actually a tree that can be treated as XML, e.g. with XPath, XSLT,
>> tree iteration, ... You will also get plain XML when you serialise it to XML:
>>
>>    >>> xml_string = etree.tostring(tree)
>>
>> Note that this doesn't add any namespaces, so you will not magically get valid
>> XHTML or something. You could rewrite the tags by hand, though.
>>
>> Stefan
> 
> I got it to work with lxml. See below:
> 
> def Parser(filename):
>     parser = etree.HTMLParser()
>     tree = etree.parse(r'path/to/nextpage.htm', parser)
>     xml_string = etree.tostring(tree)
>     events = ("recordnum", "primaryowner", "customeraddress")
>     context = etree.iterparse(StringIO(xml_string), tag='')
>     for action, elem in context:
> 	tag = elem.tag
> 	if tag == 'primaryowner':
>             owner = elem.text
>         elif tag == 'customeraddress':
>             address = elem.text
>         else:
>             pass
> 
>     print 'Primary Owner: %s' % owner
>     print 'Address: %s' % address
> 
> Does this make sense? It works pretty well, but I don't really
> understand everything that I'm doing.
> 
> Mike
> 

Q? Once you get your document into an XML tree in memory, while do you 
go to event-based handling to extract your data ?

Try to directly manipulate the tree.

parser = etree.HTMLParser()
tree = etree.parse(r'path/to/nextpage.htm', parser)
myrows = tree.findall(".//Row")

# Then work with the sub-elements.
for r in myrows :
	rnumelem = r.find("RecordNum")
	makeeleme = r.find("Make")
	modelelem = r.find("Model")

& co.




More information about the Python-list mailing list