Is pyparsing really a recursive descent parser?

Sat Nov 3 04:32:23 EDT 2007

On Nov 3, 6:33 am, "Just Another Victim of the Ambient Morality"
<ihates... at hotmail.com> wrote:
> "Paul McGuire" <pt... at austin.rr.com> wrote in message
>
> news:1194007658.200713.178210 at 57g2000hsv.googlegroups.com...
>
>
>
> > On Nov 2, 5:47 am, Marc 'BlackJack' Rintsch <bj_... at gmx.net> wrote:
>
> >> Pyparsing is no recursive descent parser.  It doesn't go back in the
> >> input
> >> stream.  The ``OneOrMore(Word(alphas))`` part "eats" the 'end' and when
> >> it
> >> can't get more, the parser moves to the ``Literal('end')`` part which
> >> fails because the 'end' is already gone.
>
> >> >  Is there a way to get pyparsing to parse a grammar like this?
>
> >> Negative lookahead maybe:
>
> >> grammar = (OneOrMore(NotAny(Literal('end')) + Word(alphas))
> >>            + Literal('end'))
>
> >> Ciao,
> >>         Marc 'BlackJack' Rintsch- Hide quoted text -
>
> >> - Show quoted text -
>
> > Well I'll be darned!  All this time, I thought "recursive descent"
> > described the recursive behavior of the parser, which pyparsing
> > definitely has.  I never knew that backing up in the face of parse
> > mismatches was a required part of the picture.
>
>     It has recursion in it but that's not sufficient to call it a recursive
> descent parser any more than having a recursive implementation of the
> factorial function.  The important part is what it recurses through...
>
>
>
> > In pyparsing, each construction gets composed from more fine-grained
> > constructions, and they are called recursively to match or not match
> > against the input string.
>
> > For example, taking your working parser example:
>
> > grammar = (OneOrMore(NotAny(Literal('end')) + Word(alphas))
> >           + Literal('end'))
>
> > This creates the following data structure:
>
> > - And
> >  - OneOrMore
> >    - And
> >      - NotAny
> >        - Literal('end')
> >      - Word(alphas)
> >  - Literal('end')
>
> > Every instance in this structure derives from the base class
> > ParserElement, which defines a method parse().  parse() in turn calls
> > preParse(), parseImpl(), and postParse().  If parseImpl succeeds, it
> > returns a ParseResults object and the next location within the input
> > string to continue parsing.
>
> > The parseImpl methods are most often overridden in the subclasses (a
> > few override postParse as well), such as:
> > - And.parseImpl invokes parse() (note the recursion) on each of the
> > expressions in its list.  All must succeed or And.parseImpl throws an
> > exception.
> > - OneOrMore.parseImpl invokes parse() on its contained expression,
> > which must succeed at least once; if not, the exception is rethrown.
> > If the contained expression succeeds once, then its parse() method is
> > called again and again until it fails, but no exceptions are rethrown,
> > since only one match was actually required.
> > - NotAny inverts the success/failure of its contained expression.  If
> > the expression's parse() method succeeds, NotAny.parseImpl throws an
> > exception.  If the contained expression's parse() method throws an
> > exception, NotAny returns successfully.  (NotAny does not advance the
> > parse location, nor does it return any tokens.)
> > - Literal and Word are terminals, in that they do not invoke any other
> > expression's parse() method.  Literal.parseImpl tests whether its
> > string exists at the current parse location, and throws an exception
> > if it doesn't.  Word.parseImpl tests whether the current parse
> > location contains a letter in the Word instance's set of valid initial
> > characters - if so success; if not, throws an exception.  It then
> > advances through the input string, matching characters in the Word
> > instance's set of valid body characters.  The entire matched string is
> > then returned, along with an updated string index at which to continue
> > parsing.
>
>     Thank you for the detailed description of pyparsing's implementation.
>
> > In my concept of "recursive descent" parsing, I was under the
> > impression that pyparsing's use of this data structure, and
> > *recursive* calls of parse() as it *descends* through the data
> > structure, constituted a recursive descent parser.  What the OP
> > requested was a more regular expression-ish matcher, with backup (or
> > backtracking).
>
>     In my concept of "recursive descent" parsing, I was under the impression
> that one should recurse through all rule combinations to ensure that the
> grammar is fully applied.  As I have mentioned before, merely having
> recursion in your algorithm is insufficient.  What you recurse through is
> key.  pyparsing recurses through rules but what's important is to recurse
> through rule combinations.

I think the confusing aspect of pyparsing for someone like me coming
from an (E)BNF and formal language background is that scanning and
parsing are merged into one. pyparsing is scannerless and where a
tokenizer such as Pythons performs a longest match tokenization but
interprets CFGs correctly ( in the bounds of the parsers power ) one
has to disambiguate grammars in pyparsing where you expect a CFG to be
established.

Note that I don't like pyparsings approach and it is clearly not for
everyone. I rather tend to make the experience that token definitions
can be reused across different languages. I count 65 token for Python
and I added 10 token for an experimental C++ parser and deactivated
some others. The distance between both languages on a lexical level is
not that huge.

What pyparsing has been established in the Python community is
something different IMO: readable and verbose pattern matching syntax
which is not regexp line noise. Around 8 years ago when I started
using Python I found a little module called reverb.py. It was actually
nothing but a front end for regular expressions. It didn't make any
career but I think it is the only one survived the anagrammatical
release hopping from 1.5.2 to 2.5.1 in my own case:

from reverb import*
>>> p = required(required(alphanum)+whitespace)+text("end")
>>> p
<reverb.raw instance at 0x00C0D260>
>>> RE(p).pattern
'(?:\\w+\\s)+end'