identifying and parsing string in text file

Paul McGuire ptmcg at austin.rr.com
Sat Mar 8 17:10:33 EST 2008


On Mar 8, 2:02 pm, Nemesis <neme... at nowhere.invalid> wrote:
> Bryan.Fodn... at gmail.com wrote:
> > I have a large file that has many lines like this,
>
> > <element tag="300a,0014" vr="CS" vm="1" len="4"
> > name="DoseReferenceStructureType">SITE</element>
>
> > I would like to identify the line by the tag (300a,0014) and then grab
> > the name (DoseReferenceStructureType) and value (SITE).
>
> You should try with Regular Expressions or if it is something like xml there
> is for sure a library you can you to parse it ...
<snip>

When it comes to parsing HTML or XML of uncontrolled origin, regular
expressions are an iffy proposition.  You'd be amazed what kind of
junk shows up inside an XML (or worse, HTML) tag.

Pyparsing includes a builtin method for constructing tag matching
parsing patterns, which you can then use to scan through the XML or
HTML source:

from pyparsing import makeXMLTags, withAttribute, SkipTo

testdata = """
<blah>
<element tag="300a,0014" vr="CS" vm="1" len="4"
name="DoseReferenceStructureType">SITE</element>
<element tag="300Z,0019" vr="CS" vm="1" len="4"
name="DoseReferenceStructureType">SITEXXX</element>
<element tag="300a,0014" vr="CS" vm="1" len="4"
name="DoseReferenceStructureType">SITE2</element>
<blahblah>
"""

elementStart,elementEnd = makeXMLTags("element")
elementStart.setParseAction(withAttribute(tag="300a,0014"))
search = elementStart + SkipTo(elementEnd)("body")

for t in search.searchString(testdata):
    print t.name
    print t.body

Prints:

DoseReferenceStructureType
SITE
DoseReferenceStructureType
SITE2

In this case, the parse action withAttribute filters <element> tag
matches, accepting *only* those with the attribute "tag" and the value
"300a,0014".  The pattern search adds on the body of the <element></
element> tag, and gives it the name "body" so it is easily accessed
after parsing is completed.

-- Paul
(More about pyparsing at http://pyparsing.wikispaces.com.)



More information about the Python-list mailing list