Improving my text processing script

Paul McGuire ptmcg at austin.rr.com
Thu Sep 1 02:45:40 EDT 2005


Even though you are using re's to try to look for specific substrings
(which you sort of fake in by splitting on "Identifier", and then
prepending "Identifier" to every list element, so that the re will
match...), this program has quite a few holes.

What if the word "Identifier" is inside one of the quoted strings?
What if the actual value is "tablename10"?  This will match your
"tablename1" string search, but it is certainly not what you want.
Did you know there are trailing blanks on your table names, which could
prevent any program name from matching?

So here is an alternative approach using, as many have probably
predicted by now if they've spent any time on this list, the pyparsing
module.  You may ask, "isn't a parser overkill for this problem?" and
the answer will likely be "probably", but in the case of pyparsing, I'd
answer "probably, but it is so easy, and takes care of so much junk
like dealing with quoted strings and intermixed data, so, who cares if
it's overkill?"

So here is the 20-line pyparsing solution, insert it into your program
after you have read in tlst, and read in the input data using something
like data = file('plst).read().  (The first line strips the whitespace
from the ends of your table names.)

tlist = map(str.rstrip, tlist)

from pyparsing import quotedString,LineStart,LineEnd,removeQuotes
quotedString.setParseAction( removeQuotes )

identLine = (LineStart() + "Identifier" + quotedString +
LineEnd()).setResultsName("identifier")
tableLine = (LineStart() + "Value" + quotedString +
LineEnd()).setResultsName("tableref")

interestingLines = ( identLine | tableLine )
thisprog = ""
for toks,start,end in interestingLines.scanString( data ):
    toktype = toks.getName()
    if toktype == 'identifier':
        thisprog = toks[1]
    elif toktype == 'tableref':
        thistable = toks[1]
        if thistable in tlist:
            print '"%s","%s"' % (thisprog, thistable)
        else:
            print "Not", thisprog, "contains wrong table
("+thistable+")"

This program will print out:
"Program1","tablename2"
"Program 2","tablename2"


Download pyparsing at http://pyparsing.sourceforge.net.

-- Paul




More information about the Python-list mailing list