need help extracting data from a text file

Paul McGuire ptmcg at austin.rr._bogus_.com
Mon Nov 7 12:43:17 EST 2005


<nephish at xit.net> wrote in message
news:1131375863.977379.120620 at f14g2000cwb.googlegroups.com...
> Hey there,
> i have a text file with a bunch of values scattered throughout it.
> i am needing to pull out a value that is in parenthesis right after a
> certain word,
> like the first time the word 'foo' is found, retrieve the values in the
> next set of parenthesis (bar) and it would return 'bar'
>
> i think i can use re to do this, but is there some easier way?
> thanks
>
Using string methods to locate the 'foo' instances is by far the fastest way
to go.

If your requirements get more complicated, look into using pyparsing
(http://pyparsing.sourceforge.net).  Here is a pyparsing rendition of this
problem.  This does three scans through some sample data - the first lists
all matches, the second ignores matches if they are found inside a quoted
string, and the third reports only the third match.  This kind of
context-sensitive matching gets trickier with basic string and re tools.

-- Paul

data = """
i have a text file with a bunch of foo(bar1) values scattered throughout it.
i am needing to pull out a value that foo(bar2) is in parenthesis right
after a
certain word,
like the foo(bar3) first time the word 'foo' is found, retrieve the values
in the
next set of parenthesis foo(bar4) and it would return 'bar'
do we want to skip things in quotes, such as 'foo(barInQuotes)'?
"""

from pyparsing import Literal,SkipTo,quotedString

pattern = Literal("foo") + "(" + SkipTo(")").setResultsName("payload") + ")"

# report all occurrences of xxx found in "foo(xxx)"
for tokens,start,end in pattern.scanString(data):
    print tokens.payload, "at location", start
print

# ignore quoted strings
pattern.ignore(quotedString)
for tokens,start,end in pattern.scanString(data):
    print tokens.payload, "at location", start
print

# only report 3rd occurrence
tokenMatch = {'foo':0}
def thirdTimeOnly(strg,loc,tokens):
    word = tokens[0]
    if word in tokenMatch:
        tokenMatch[word] += 1
        if tokenMatch[word] != 3:
            raise ParseException(strg,loc,"wrong occurrence of token")

pattern.setParseAction(thirdTimeOnly)
for tokens,start,end in pattern.scanString(data):
    print tokens.payload, "at location", start
print

Prints:
bar1 at location 36
bar2 at location 116
bar3 at location 181
bar4 at location 278
barInQuotes at location 360

bar1 at location 36
bar2 at location 116
bar3 at location 181
bar4 at location 278

bar3 at location 181





More information about the Python-list mailing list