Processing XML that's embedded in HTML

Mike Driscoll kyosohma at gmail.com
Tue Jan 22 11:57:18 EST 2008


Hi,

I need to parse a fairly complex HTML page that has XML embedded in
it. I've done parsing before with the xml.dom.minidom module on just
plain XML, but I cannot get it to work with this HTML page.

The XML looks like this:

<Row status="o">

    <Relationship>Owner</Relationship>

    <Priority>1</Priority>

    <StartDate>07/16/2007</StartDate>

    <StopsExist>No</StopsExist>

    <Name>Doe, John</Name>

    <Address>1905 S 3rd Ave , Hicksville IA 99999</Address>

  </Row>

  <Row status="o">

    <Relationship>Owner</Relationship>

    <Priority>2</Priority>

    <StartDate>07/16/2007</StartDate>

    <StopsExist>No</StopsExist>

    <Name>Doe, Jane</Name>

    <Address>1905 S 3rd Ave , Hicksville IA 99999</Address>

  </Row>

It appears to be enclosed with <XML
id="grdRegistrationInquiryCustomers"><BoundData>

The rest of the document is html, javascript div tags, etc. I need the
information only from the row where the Relationship tag = Owner and
the Priority tag = 1. The rest I can ignore. When I tried parsing it
with minidom, I get an ExpatError: mismatched tag: line 1, column 357
so I think the HTML is probably malformed.

I looked at BeautifulSoup, but it seems to separate its HTML
processing from its XML processing. Can someone give me some pointers?

I am currently using Python 2.5 on Windows XP. I will be using
Internet Explorer 6 since the document will not display correctly in
Firefox.

Thank you very much!

Mike



More information about the Python-list mailing list