Processing XML that's embedded in HTML
Paul McGuire
ptmcg at austin.rr.com
Tue Jan 22 18:31:19 EST 2008
On Jan 22, 10:57 am, Mike Driscoll <kyoso... at gmail.com> wrote:
> Hi,
>
> I need to parse a fairly complex HTML page that has XML embedded in
> it. I've done parsing before with the xml.dom.minidom module on just
> plain XML, but I cannot get it to work with this HTML page.
>
> The XML looks like this:
>
...
Once again (this IS HTML Day!), instead of parsing the HTML, pyparsing
can help lift the interesting bits and leave the rest alone. Try this
program out:
from pyparsing import
makeXMLTags,Word,nums,Combine,oneOf,SkipTo,withAttribute
htmlWithEmbeddedXml = """
<HTML>
<Body>
<p>
<b>Hey! this is really bold!</b>
<Row status="o">
<Relationship>Owner</Relationship>
<Priority>1</Priority>
<StartDate>07/16/2007</StartDate>
<StopsExist>No</StopsExist>
<Name>Doe, John</Name>
<Address>1905 S 3rd Ave , Hicksville IA 99999</Address>
</Row>
<Row status="o">
<Relationship>Owner</Relationship>
<Priority>2</Priority>
<StartDate>07/16/2007</StartDate>
<StopsExist>No</StopsExist>
<Name>Doe, Jane</Name>
<Address>1905 S 3rd Ave , Hicksville IA 99999</Address>
</Row>
<table>
<tr><Td>this is in a table, woo-hoo!</td>
more HTML
blah blah blah...
"""
# define pyparsing expressions for XML tags
rowStart,rowEnd = makeXMLTags("Row")
relationshipStart,relationshipEnd = makeXMLTags("Relationship")
priorityStart,priorityEnd = makeXMLTags("Priority")
startDateStart,startDateEnd = makeXMLTags("StartDate")
stopsExistStart,stopsExistEnd = makeXMLTags("StopsExist")
nameStart,nameEnd = makeXMLTags("Name")
addressStart,addressEnd = makeXMLTags("Address")
# define some useful expressions for data of specific types
integer = Word(nums)
date = Combine(Word(nums,exact=2)+"/"+
Word(nums,exact=2)+"/"+Word(nums,exact=4))
yesOrNo = oneOf("Yes No")
# conversion parse actions
integer.setParseAction(lambda t: int(t[0]))
yesOrNo.setParseAction(lambda t: t[0]=='Yes')
# could also define a conversion for date if you really wanted to
# define format of a <Row>, plus assign results names for each data
field
rowRec = rowStart + \
relationshipStart + SkipTo(relationshipEnd)("relationship") +
relationshipEnd + \
priorityStart + integer("priority") + priorityEnd + \
startDateStart + date("startdate") + startDateEnd + \
stopsExistStart + yesOrNo("stopsexist") + stopsExistEnd + \
nameStart + SkipTo(nameEnd)("name") + nameEnd + \
addressStart + SkipTo(addressEnd)("address") + addressEnd + \
rowEnd
# set filtering parse action
rowRec.setParseAction(withAttribute(relationship="Owner",priority=1))
# find all matching rows, matching grammar and filtering parse action
rows = rowRec.searchString(htmlWithEmbeddedXml)
# print the results (uncomment r.dump() statement to see full
# result for each row)
for r in rows:
# print r.dump()
print r.relationship
print r.priority
print r.startdate
print r.stopsexist
print r.name
print r.address
This prints:
Owner
1
07/16/2007
False
Doe, John
1905 S 3rd Ave , Hicksville IA 99999
In addition to parsing this data, some conversions were done at parse
time, too - "1" was converted to the value 1, and "No" was converted
to False. These were done by the conversion parse actions. The
filtering just for Row's containing Relationship="Owner" and
Priority=1 was done in a more global parse action, called
withAttribute. If you comment this line out, you will see that both
rows get retrieved.
-- Paul
(Find out more about pyparsing at http://pyparsing.wikispaces.com.)
More information about the Python-list
mailing list