Processing XML that's embedded in HTML

Tue Jan 22 12:32:22 EST 2008

On 22 Jan, 17:57, Mike Driscoll <kyoso... at gmail.com> wrote:
>
> I need to parse a fairly complex HTML page that has XML embedded in
> it. I've done parsing before with the xml.dom.minidom module on just
> plain XML, but I cannot get it to work with this HTML page.

It's HTML day on comp.lang.python today! ;-)

> The XML looks like this:
>
> <Row status="o">
>
>     <Relationship>Owner</Relationship>
>
>     <Priority>1</Priority>
>
>     <StartDate>07/16/2007</StartDate>
>
>     <StopsExist>No</StopsExist>
>
>     <Name>Doe, John</Name>
>
>     <Address>1905 S 3rd Ave , Hicksville IA 99999</Address>
>
>   </Row>
>
>   <Row status="o">
>
>     <Relationship>Owner</Relationship>
>
>     <Priority>2</Priority>
>
>     <StartDate>07/16/2007</StartDate>
>
>     <StopsExist>No</StopsExist>
>
>     <Name>Doe, Jane</Name>
>
>     <Address>1905 S 3rd Ave , Hicksville IA 99999</Address>
>
>   </Row>
>
> It appears to be enclosed with <XML
> id="grdRegistrationInquiryCustomers"><BoundData>

You could probably find the Row elements with the following XPath
expression:

  //XML/BoundData/Row

More specific would be this:

  //XML[@id="grdRegistrationInquiryCustomers"]/BoundData/Row

See below for the relevance of this. You could also try using
getElementById on the document, specifying the id attribute's value
given above, then descending to find the Row elements.

> The rest of the document is html, javascript div tags, etc. I need the
> information only from the row where the Relationship tag = Owner and
> the Priority tag = 1. The rest I can ignore. When I tried parsing it
> with minidom, I get an ExpatError: mismatched tag: line 1, column 357
> so I think the HTML is probably malformed.

Or that it isn't well-formed XML, at least.

> I looked at BeautifulSoup, but it seems to separate its HTML
> processing from its XML processing. Can someone give me some pointers?

With libxml2dom [1] I'd do something like this:

  import libxml2dom
  d = libxml2dom.parse(filename, html=1)
  # or: d = parseURI(uri, html=1)
  rows = d.xpath("//XML/BoundData/Row")
  # or: rows = d.xpath("//XML[@id="grdRegistrationInquiryCustomers"]/
BoundData/Row")

Even though the document is interpreted as HTML, you should get a DOM
containing the elements as libxml2 interprets them.

> I am currently using Python 2.5 on Windows XP. I will be using
> Internet Explorer 6 since the document will not display correctly in
> Firefox.

That shouldn't be much of a surprise, it must be said: it isn't XHTML,
where you might be able to extend the document via XML, so the whole
document has to be "proper" HTML.

Paul

[1] http://www.python.org/pypi/libxml2dom