Is pyparsing really a recursive descent parser?

Fri Nov 2 08:47:38 EDT 2007

On Nov 2, 5:47 am, Marc 'BlackJack' Rintsch <bj_... at gmx.net> wrote:
>
> Pyparsing is no recursive descent parser.  It doesn't go back in the input
> stream.  The ``OneOrMore(Word(alphas))`` part "eats" the 'end' and when it
> can't get more, the parser moves to the ``Literal('end')`` part which
> fails because the 'end' is already gone.
>
> >  Is there a way to get pyparsing to parse a grammar like this?
>
> Negative lookahead maybe:
>
> grammar = (OneOrMore(NotAny(Literal('end')) + Word(alphas))
>            + Literal('end'))
>
> Ciao,
>         Marc 'BlackJack' Rintsch- Hide quoted text -
>
> - Show quoted text -

Well I'll be darned!  All this time, I thought "recursive descent"
described the recursive behavior of the parser, which pyparsing
definitely has.  I never knew that backing up in the face of parse
mismatches was a required part of the picture.

In pyparsing, each construction gets composed from more fine-grained
constructions, and they are called recursively to match or not match
against the input string.

For example, taking your working parser example:

grammar = (OneOrMore(NotAny(Literal('end')) + Word(alphas))
           + Literal('end'))

This creates the following data structure:

- And
  - OneOrMore
    - And
      - NotAny
        - Literal('end')
      - Word(alphas)
  - Literal('end')

Every instance in this structure derives from the base class
ParserElement, which defines a method parse().  parse() in turn calls
preParse(), parseImpl(), and postParse().  If parseImpl succeeds, it
returns a ParseResults object and the next location within the input
string to continue parsing.

The parseImpl methods are most often overridden in the subclasses (a
few override postParse as well), such as:
- And.parseImpl invokes parse() (note the recursion) on each of the
expressions in its list.  All must succeed or And.parseImpl throws an
exception.
- OneOrMore.parseImpl invokes parse() on its contained expression,
which must succeed at least once; if not, the exception is rethrown.
If the contained expression succeeds once, then its parse() method is
called again and again until it fails, but no exceptions are rethrown,
since only one match was actually required.
- NotAny inverts the success/failure of its contained expression.  If
the expression's parse() method succeeds, NotAny.parseImpl throws an
exception.  If the contained expression's parse() method throws an
exception, NotAny returns successfully.  (NotAny does not advance the
parse location, nor does it return any tokens.)
- Literal and Word are terminals, in that they do not invoke any other
expression's parse() method.  Literal.parseImpl tests whether its
string exists at the current parse location, and throws an exception
if it doesn't.  Word.parseImpl tests whether the current parse
location contains a letter in the Word instance's set of valid initial
characters - if so success; if not, throws an exception.  It then
advances through the input string, matching characters in the Word
instance's set of valid body characters.  The entire matched string is
then returned, along with an updated string index at which to continue
parsing.

In my concept of "recursive descent" parsing, I was under the
impression that pyparsing's use of this data structure, and
*recursive* calls of parse() as it *descends* through the data
structure, constituted a recursive descent parser.  What the OP
requested was a more regular expression-ish matcher, with backup (or
backtracking).

Your example did not include either of the alternation classes,
MatchFirst or Or.  These classes implement a form of backtracking on
the stack, that if we descend into an expression on the list of
alternatives, and that expression fails, then MatchFirst or Or will
back up to the same starting location and try the next alternative.
(MatchFirst is an "eager" matcher, in that it will pick the first
matching expression in its list - Or is an "exhaustive" matcher, in
that it will evaluate all of the alternatives, and select the longest
match.)

The Wikipedia entry for "Recursive Descent Parser" describes this
parser model as a "predictive parser", and later goes on to say that
some (uncited) authors equate "predictive parser" with "recursive
descent parsers".  The article makes a special distinction for a
"recursive parser with backup", which is what I believe the OP was
asking for.  Yes, pyparsing is definitely *not* a "recursive descent
parser with backup."  The solution, as you correctly posted, is to add
a negative lookahead.  NotAny is also represented using the '~'
operator.

By the way, there are a couple of other places where pyparsing
diverges from the standard model:
- implicit whitespace skipping
- user-definable comment skipping
- attachment of parse actions to expressions
These features, while atypical, provide some added capability and ease-
of-use in creating quick and simple parsers.

So I will take my stance with the uncited authors who lump predictive
parsers in with recursive descent parsers.

-- Paul