identifying and parsing string in text file
Paul McGuire
ptmcg at austin.rr.com
Sat Mar 8 17:10:33 EST 2008
On Mar 8, 2:02 pm, Nemesis <neme... at nowhere.invalid> wrote:
> Bryan.Fodn... at gmail.com wrote:
> > I have a large file that has many lines like this,
>
> > <element tag="300a,0014" vr="CS" vm="1" len="4"
> > name="DoseReferenceStructureType">SITE</element>
>
> > I would like to identify the line by the tag (300a,0014) and then grab
> > the name (DoseReferenceStructureType) and value (SITE).
>
> You should try with Regular Expressions or if it is something like xml there
> is for sure a library you can you to parse it ...
<snip>
When it comes to parsing HTML or XML of uncontrolled origin, regular
expressions are an iffy proposition. You'd be amazed what kind of
junk shows up inside an XML (or worse, HTML) tag.
Pyparsing includes a builtin method for constructing tag matching
parsing patterns, which you can then use to scan through the XML or
HTML source:
from pyparsing import makeXMLTags, withAttribute, SkipTo
testdata = """
<blah>
<element tag="300a,0014" vr="CS" vm="1" len="4"
name="DoseReferenceStructureType">SITE</element>
<element tag="300Z,0019" vr="CS" vm="1" len="4"
name="DoseReferenceStructureType">SITEXXX</element>
<element tag="300a,0014" vr="CS" vm="1" len="4"
name="DoseReferenceStructureType">SITE2</element>
<blahblah>
"""
elementStart,elementEnd = makeXMLTags("element")
elementStart.setParseAction(withAttribute(tag="300a,0014"))
search = elementStart + SkipTo(elementEnd)("body")
for t in search.searchString(testdata):
print t.name
print t.body
Prints:
DoseReferenceStructureType
SITE
DoseReferenceStructureType
SITE2
In this case, the parse action withAttribute filters <element> tag
matches, accepting *only* those with the attribute "tag" and the value
"300a,0014". The pattern search adds on the body of the <element></
element> tag, and gives it the name "body" so it is easily accessed
after parsing is completed.
-- Paul
(More about pyparsing at http://pyparsing.wikispaces.com.)
More information about the Python-list
mailing list