Processing XML that's embedded in HTML

Tue Jan 22 18:31:19 EST 2008

On Jan 22, 10:57 am, Mike Driscoll <kyoso... at gmail.com> wrote:
> Hi,
>
> I need to parse a fairly complex HTML page that has XML embedded in
> it. I've done parsing before with the xml.dom.minidom module on just
> plain XML, but I cannot get it to work with this HTML page.
>
> The XML looks like this:
>
...

Once again (this IS HTML Day!), instead of parsing the HTML, pyparsing
can help lift the interesting bits and leave the rest alone.  Try this
program out:

from pyparsing import
makeXMLTags,Word,nums,Combine,oneOf,SkipTo,withAttribute

htmlWithEmbeddedXml = """
<HTML>
<Body>
<p>
<b>Hey! this is really bold!</b>

<Row status="o">
    <Relationship>Owner</Relationship>
    <Priority>1</Priority>
    <StartDate>07/16/2007</StartDate>
    <StopsExist>No</StopsExist>
    <Name>Doe, John</Name>
    <Address>1905 S 3rd Ave , Hicksville IA 99999</Address>
  </Row>

  <Row status="o">
    <Relationship>Owner</Relationship>
    <Priority>2</Priority>
    <StartDate>07/16/2007</StartDate>
    <StopsExist>No</StopsExist>
    <Name>Doe, Jane</Name>
    <Address>1905 S 3rd Ave , Hicksville IA 99999</Address>
  </Row>

<table>
<tr><Td>this is in a table, woo-hoo!</td>
more HTML
blah blah blah...
"""

# define pyparsing expressions for XML tags
rowStart,rowEnd                   = makeXMLTags("Row")
relationshipStart,relationshipEnd = makeXMLTags("Relationship")
priorityStart,priorityEnd         = makeXMLTags("Priority")
startDateStart,startDateEnd       = makeXMLTags("StartDate")
stopsExistStart,stopsExistEnd     = makeXMLTags("StopsExist")
nameStart,nameEnd                 = makeXMLTags("Name")
addressStart,addressEnd           = makeXMLTags("Address")

# define some useful expressions for data of specific types
integer = Word(nums)
date = Combine(Word(nums,exact=2)+"/"+
            Word(nums,exact=2)+"/"+Word(nums,exact=4))
yesOrNo = oneOf("Yes No")

# conversion parse actions
integer.setParseAction(lambda t: int(t[0]))
yesOrNo.setParseAction(lambda t: t[0]=='Yes')
# could also define a conversion for date if you really wanted to

# define format of a <Row>, plus assign results names for each data
field
rowRec = rowStart + \
    relationshipStart + SkipTo(relationshipEnd)("relationship") +
relationshipEnd + \
    priorityStart + integer("priority") + priorityEnd + \
    startDateStart + date("startdate") + startDateEnd + \
    stopsExistStart + yesOrNo("stopsexist") + stopsExistEnd + \
    nameStart + SkipTo(nameEnd)("name") + nameEnd + \
    addressStart + SkipTo(addressEnd)("address") + addressEnd + \
    rowEnd

# set filtering parse action
rowRec.setParseAction(withAttribute(relationship="Owner",priority=1))

# find all matching rows, matching grammar and filtering parse action
rows = rowRec.searchString(htmlWithEmbeddedXml)

# print the results (uncomment r.dump() statement to see full
# result for each row)
for r in rows:
    # print r.dump()
    print r.relationship
    print r.priority
    print r.startdate
    print r.stopsexist
    print r.name
    print r.address

This prints:
Owner
1
07/16/2007
False
Doe, John
1905 S 3rd Ave , Hicksville IA 99999

In addition to parsing this data, some conversions were done at parse
time, too - "1" was converted to the value 1, and "No" was converted
to False.  These were done by the conversion parse actions.  The
filtering just for Row's containing Relationship="Owner" and
Priority=1 was done in a more global parse action, called
withAttribute.  If you comment this line out, you will see that both
rows get retrieved.

-- Paul
(Find out more about pyparsing at http://pyparsing.wikispaces.com.)