[Tutor] pyparsing complex search

Thu Jul 16 23:57:24 CEST 2009

Pedro -

If you are trying to extract a simple pattern like a numeric word followed
by an alpha word, I would suggest using one of the scanString or
searchString methods.  scanString is probably the better choice, since you
seem to need not only the matching tokens, but also the location within the
input string where the match occurs.

scanString(sourcestring) is a generator function.  For every match,
scanString returns:
- tokens - the matched tokens themselves (as a ParseResults object)
- startloc - starting location of the matched tokens
- endloc - ending location of the matched tokens

Here is your test program, rewritten to use scanString (I inserted the
number '75' in the source string so you can see the results of having
multiple matches):

from pyparsing import *

data = """23 different size pry bars
hammer the 75 pitchfork
pound felt paper staple the felt paper every to inches staple hammer"""

numbrword = Word(nums)
alphaword = Word(alphas)

for match,locn,endloc in (numbrword+alphaword).scanString(data):
    num,wrd = match
    st = data
    print "Found '%s/%s' on line %d at column %d" % \
        (num,wrd, lineno(locn , st ), col( locn , st ))
    print "The full line of text was:"
    print "'%s'" % line( locn , st )
    print (" "*col( locn , st ))+"^"
    print

With this output:

Found '23/different' on line 1 at column 1
The full line of text was:
'23 different size pry bars'
 ^

Found '75/pitchfork' on line 2 at column 12
The full line of text was:
'hammer the 75 pitchfork'
            ^

Look at the difference between this program and your latest version.  The
pattern you parsed for is simply "OneOrMore(Word(alphanums))" - to debug a
parse action, try using the decorator that comes with pyparsing,
traceParseAction, like this:

@traceParseAction
def reportACertainWord( st , locn , toks ):

and it will print out the tokens passed into and out of the parse action.
It would have shown you that you weren't just matching ['23', 'different'],
but the entire input string.

Using scanString is a little simpler than writing a parse action.  Also,
since scanString is a generator, it will return matches as it finds them.
So if you have this file:

123 paper clips
... 10Gb of intervening non-matching text...
456 paper planes

scanString will return the match for '123 paper' right away, before
processing the intervening 10Gb of non-matching text.

Best of luck in your pyparsing efforts!

-- Paul