Is pyparsing really a recursive descent parser?

Paul McGuire ptmcg at austin.rr.com
Sat Nov 3 11:49:31 EDT 2007


On Nov 3, 12:33 am, "Just Another Victim of the Ambient Morality"
<ihates... at hotmail.com> wrote:
>     It has recursion in it but that's not sufficient to call it a recursive
> descent parser any more than having a recursive implementation of the
> factorial function.  The important part is what it recurses through...

<snip>

> In my opinion, the rule set I mentioned in my original post:
>
> grammar = OneOrMore(Word(alphas)) + Literal('end')
>
>     ...should be translated (internally) to something like this:
>
> word = Word(alphas)
> grammar = Forward()
> grammar << ((word + grammar) | (word + Literal(end)))
>
>     This allows a recursive search to find the string correct without any
> need for "backtracking," if I understand what you mean by that.  Couldn't
> pyparsing have done something like this?
>

Dear JAVotAM -

This was a great post!  (Although I'm not sure the comment in the
first paragraph was entirely fair - I know the difference between
recursive string parsing and recursive multiplication.)

You really got me thinking more about how this recursion actually
behaves, especially with respect to elements such as OneOrMore.  Your
original question is really quite common, and up until now, my stock
answer has been to use negative lookahead.  The expression you posted
is the first time I've seen this solution, and it *does* work.

I was all set to write a cocky response on why your expression
wouldn't work.  I've seen it many times before, where people (usually
coming from EBNF experience) implement listOfXs = OneOrMore(x) as:

listOfXs = Forward()
listOfXs << ( x + listOfXs | x )

Actually, what they usually write is:

listOfXs << ( listOfXs + x )

but this sends pyparsing into a recursive tailspin.

So I fired up SciTE and copy/pasted your code into the editor and ran
it, and it worked just fine - this was a shock!  I studied this for a
few minutes, and then realized what was happening.  First of all, I
misread what you posted.  You posted this:

grammar << ((word + grammar) | (word + Literal(end)))

which works.  I *thought* you posted this:

grammar << ((word + grammar) | word) + Literal(end)

which doesn't work.  In fact this behaves the same as your original
post, except it iterates over the input string's words recursively,
vs. repetition ins a for-loop, as is done by OneOrMore.

So going back to your working version, I had to see why it works.
Initially, the first term in the MatchFirst (the '|' operator creates
MatchFirst instances) is just the same, and by grammar referencing
itself, it just goes word by word through the input trying to find a
match.  I'll try to visualize the steps:

level    "First Second Third end"
1         word  grammar
2               word   grammar
3                      word  grammar
4                            word grammar <- fails!
4                            word end <- fails!
                             (therefore level 3 grammar fails!)
3                      word  end    <-- success!!!

grammar has 2 options: to match a word followed by a grammar, or to
match a word followed by 'end'.  At 4 levels deep into the Forward's
recursion, the first option fails, because after parsing "end" as the
initial word, there is no more text to try to match against grammar.
Level 4's Forward then also tries to match a word followed by 'end',
but this fails for the same reason.  So at this point, the 4th level
Forward fails to match either of its options, so it throws its
exception back up to level 3, indicating that the first alternative,
word followed by grammar, failed.  Level 3 then moves on to see if
word followed by the literal 'end' matches, and it does - success!

Here's where I am stuck now.  In the original grammar that you posted,
which you want to render into this behavior, grammar is defined as:

grammar = OneOrMore(Word(alphas)) + Literal('end')

Unfortunately, when the OneOrMore gets constructed, it does not have
any visibility beyond knowing what is to be repeated.  Again, here is
the data structure that is being built:

- And
  - OneOrMore
    - Word(alphas)
  - Literal('end')

Only at the level of the And is there any awareness that the OneOrMore
is followed by anything, let alone by something which could be
misinterpreted as something matching the OneOrMore's repetition
expression.

Can you suggest a way I could generalize this, so that OneOrMore stops
matching before it gets to 'end'?

-- Paul




More information about the Python-list mailing list