Regex help needed

Paul McGuire ptmcg at austin.rr._bogus_.com
Tue Jan 10 18:53:36 EST 2006


"rh0dium" <sklass at pointcircle.com> wrote in message
news:1136932787.428988.62560 at g44g2000cwa.googlegroups.com...
>
> Paul McGuire wrote:
>
> > ident = Combine( Word(alpha,alphanums+"_") + LPAR + RPAR )
>
> This will only work for a word with a parentheses ( ie.  somefunction()
> )
>
> > If you *really* want everything on the first line to be the ident, try
this:
> >
> > ident = Word(alpha,alphanums+"_") + restOfLine
> > or
> > ident = Combine( Word(alpha,alphanums+"_") + restOfLine )
>
> This nicely grabs the "\r"..  How can I get around it?
>
> > Now the next step is to assign field names to the results:
> >
> > dataFormat = ident.setResultsName("ident") + ( dblQuotedString |
> > quoteList ).setResultsName("contents")
>
> This is super cool!!
>
> So let's take this for example
>
> test= 'fprintf( outFile "leSetInstSelectable( t )\n" )\r\n ("test"
> "test1" "foo aasdfasdf"\r\n "newline" "test2")\r\n'
>
> Now I want the ident to pull out 'fprintf( outFile
> "leSetInstSelectable( t )\n" )' so I tried to do this?
>
>     ident = Forward()
>     ident << Group( Word(alphas,alphanums) + LPAR + ZeroOrMore(
> dblQuotedString | ident | Word(alphas,alphanums) ) + RPAR)
>
> Borrowing from the example listed previously.  But it bombs out cause
> it wants a ")"  but it has one..  Forward() ROCKS!!
>
> Also how does it know to do this for just the first line?  It would
> seem that this will work for every line - No?
>
This works for me:

test4 = r"""fprintf( outFile "leSetInstSelectable( t )\n" )
("test"
"test1" "foo aasdfasdf"
"newline" "test2")
"""

ident = Forward()
ident << Group( Word(alphas,alphanums) + LPAR + ZeroOrMore(
            dblQuotedString | ident | Word(alphas,alphanums) ) + RPAR)
dataFormat = ident + ( dblQuotedString | quoteList )

print dataFormat.parseString(test4)

Prints:
[['fprintf', '(', 'outFile', '"leSetInstSelectable( t )\\n"', ')'],
['"test"', '"test1"', '"foo aasdfasdf"', '"newline"', '"test2"']]


1. Is there supposed to be a real line break in the string
"leSetInstSelectable( t )\n", or just a slash-n at the end?  pyparsing
quoted strings do not accept multiline quotes, but they do accept escaped
characters such as "\t" "\n", etc.  That is, to pyparsing:

"\n this is a valid \t \n string"

"this is not
a valid string"

Part of the confusion is that your examples include explicit \r\n
characters.  I'm assuming this is to reflect what you see when listing out
the Python variable containing the string.  (Are you opening a text file
with "rb" to read in binary?  Try opening with just "r", and this may
resolve your \r\n problems.)

2. If restOfLine is still giving you \r's at the end, you can redefine
restOfLine to not include them, or to include and suppress them.  Or (this
is easier) define a parse action for restOfLine that strips trailing \r's:

def stripTrailingCRs(st,loc,toks):
    try:
      if toks[0][-1] == '\r':
        return toks[0][:-1]
    except:
      pass

restOfLine.setParseAction( stripTrailingCRs )


3.  How does it know to only do it for the first line?  Presumably you told
it to do so.  pyparsing's parseString method starts at the beginning of the
input string, and matches expressions until it finds a mismatch, or runs out
of expressions to match - even if there is more input string to process,
pyparsing does not continue.  To search through the whole file looking for
idents, try using scanString which returns a generator; for each match, the
generator gives a tuple containing:
- tokens - the matched tokens
- start - the start location of the match
- end - the end location of the match

If your input file consists *only* of these constructs, you can also just
expand dataFormat.parseString to OneOrMore(dataFormat).parseString.


-- Paul





More information about the Python-list mailing list