Extracting xml from html

kyosohma at gmail.com kyosohma at gmail.com
Tue Sep 18 15:33:40 EDT 2007


On Sep 18, 1:56 am, Stefan Behnel <stefan.behnel-n05... at web.de> wrote:
> kyoso... at gmail.com wrote:
> > I am attempting to extract some XML from an HTML document that I get
> > returned from a form based web page. For some reason, I cannot figure
> > out how to do this.
> > Here's a sample of the html:
>
> > <html>
> > <body>
> > lots of screwy text including divs and spans
> > <Row status="o">
> >     <RecordNum>1126264</RecordNum>
> >     <Make>Mitsubishi</Make>
> >     <Model>Mirage DE</Model>
> > </Row>
> > </body>
> > </html>
>
> > What's the best way to get at the XML? Do I need to somehow parse it
> > using the HTMLParser and then parse that with minidom or what?
>
> lxml makes this pretty easy:
>
>    >>> parser = etree.HTMLParser()
>    >>> tree = etree.parse(the_file_or_url, parser)
>
> This is actually a tree that can be treated as XML, e.g. with XPath, XSLT,
> tree iteration, ... You will also get plain XML when you serialise it to XML:
>
>    >>> xml_string = etree.tostring(tree)
>
> Note that this doesn't add any namespaces, so you will not magically get valid
> XHTML or something. You could rewrite the tags by hand, though.
>
> Stefan

I got it to work with lxml. See below:

def Parser(filename):
    parser = etree.HTMLParser()
    tree = etree.parse(r'path/to/nextpage.htm', parser)
    xml_string = etree.tostring(tree)
    events = ("recordnum", "primaryowner", "customeraddress")
    context = etree.iterparse(StringIO(xml_string), tag='')
    for action, elem in context:
	tag = elem.tag
	if tag == 'primaryowner':
            owner = elem.text
        elif tag == 'customeraddress':
            address = elem.text
        else:
            pass

    print 'Primary Owner: %s' % owner
    print 'Address: %s' % address

Does this make sense? It works pretty well, but I don't really
understand everything that I'm doing.

Mike




More information about the Python-list mailing list