Processing XML that's embedded in HTML

Mike Driscoll kyosohma at gmail.com
Wed Jan 23 11:30:24 EST 2008


On Jan 22, 5:31 pm, Paul McGuire <pt... at austin.rr.com> wrote:
> On Jan 22, 10:57 am, Mike Driscoll <kyoso... at gmail.com> wrote:> Hi,
>
> > I need to parse a fairly complex HTML page that has XML embedded in
> > it. I've done parsing before with the xml.dom.minidom module on just
> > plain XML, but I cannot get it to work with this HTML page.
>
> > The XML looks like this:
>
> ...
>
> Once again (this IS HTML Day!), instead of parsing the HTML, pyparsing
> can help lift the interesting bits and leave the rest alone.  Try this
> program out:
>

Happy post-HTML Day to you!


> from pyparsing import
> makeXMLTags,Word,nums,Combine,oneOf,SkipTo,withAttribute
>
> htmlWithEmbeddedXml = """
> <HTML>
> <Body>
> <p>
> <b>Hey! this is really bold!</b>
>
> <Row status="o">
>     <Relationship>Owner</Relationship>
>     <Priority>1</Priority>
>     <StartDate>07/16/2007</StartDate>
>     <StopsExist>No</StopsExist>
>     <Name>Doe, John</Name>
>     <Address>1905 S 3rd Ave , Hicksville IA 99999</Address>
>   </Row>
>
>   <Row status="o">
>     <Relationship>Owner</Relationship>
>     <Priority>2</Priority>
>     <StartDate>07/16/2007</StartDate>
>     <StopsExist>No</StopsExist>
>     <Name>Doe, Jane</Name>
>     <Address>1905 S 3rd Ave , Hicksville IA 99999</Address>
>   </Row>
>
> <table>
> <tr><Td>this is in a table, woo-hoo!</td>
> more HTML
> blah blah blah...
> """
>
> # define pyparsing expressions for XML tags
> rowStart,rowEnd                   = makeXMLTags("Row")
> relationshipStart,relationshipEnd = makeXMLTags("Relationship")
> priorityStart,priorityEnd         = makeXMLTags("Priority")
> startDateStart,startDateEnd       = makeXMLTags("StartDate")
> stopsExistStart,stopsExistEnd     = makeXMLTags("StopsExist")
> nameStart,nameEnd                 = makeXMLTags("Name")
> addressStart,addressEnd           = makeXMLTags("Address")
>
> # define some useful expressions for data of specific types
> integer = Word(nums)
> date = Combine(Word(nums,exact=2)+"/"+
>             Word(nums,exact=2)+"/"+Word(nums,exact=4))
> yesOrNo = oneOf("Yes No")
>
> # conversion parse actions
> integer.setParseAction(lambda t: int(t[0]))
> yesOrNo.setParseAction(lambda t: t[0]=='Yes')
> # could also define a conversion for date if you really wanted to
>
> # define format of a <Row>, plus assign results names for each data
> field
> rowRec = rowStart + \
>     relationshipStart + SkipTo(relationshipEnd)("relationship") +
> relationshipEnd + \
>     priorityStart + integer("priority") + priorityEnd + \
>     startDateStart + date("startdate") + startDateEnd + \
>     stopsExistStart + yesOrNo("stopsexist") + stopsExistEnd + \
>     nameStart + SkipTo(nameEnd)("name") + nameEnd + \
>     addressStart + SkipTo(addressEnd)("address") + addressEnd + \
>     rowEnd
>
> # set filtering parse action
> rowRec.setParseAction(withAttribute(relationship="Owner",priority=1))
>
> # find all matching rows, matching grammar and filtering parse action
> rows = rowRec.searchString(htmlWithEmbeddedXml)
>
> # print the results (uncomment r.dump() statement to see full
> # result for each row)
> for r in rows:
>     # print r.dump()
>     print r.relationship
>     print r.priority
>     print r.startdate
>     print r.stopsexist
>     print r.name
>     print r.address
>
> This prints:
> Owner
> 1
> 07/16/2007
> False
> Doe, John
> 1905 S 3rd Ave , Hicksville IA 99999
>
> In addition to parsing this data, some conversions were done at parse
> time, too - "1" was converted to the value 1, and "No" was converted
> to False.  These were done by the conversion parse actions.  The
> filtering just for Row's containing Relationship="Owner" and
> Priority=1 was done in a more global parse action, called
> withAttribute.  If you comment this line out, you will see that both
> rows get retrieved.
>
> -- Paul
> (Find out more about pyparsing athttp://pyparsing.wikispaces.com.)

I've heard of this module, but never used it. Your code runs almost
out of the box on my file and returns the correct result. That's
pretty cool!

It looks like the wiki you linked to has quite a few pieces of example
code. I'll have to look this over. While I like lxml's very Object
Oriented way of doing things, I tend to get overwhelmed by their
tutorials for some reason. One more example of all those college OOP
classes being a waste of money...

Thank you for the help.

Mike



More information about the Python-list mailing list