Is pyparsing really a recursive descent parser?

Sun Nov 4 04:49:20 EST 2007

On 4 Nov., 03:07, Neil Cerutti <horp... at yahoo.com> wrote:
> On 2007-11-04, Just Another Victim of the Ambient Morality
>
>
>
> <ihates... at hotmail.com> wrote:
>
> > "Neil Cerutti" <horp... at yahoo.com> wrote in message
> >news:oz3Xi.39812$G23.23308 at newsreading01.news.tds.net...
> >> On 2007-11-03, Paul McGuire <pt... at austin.rr.com> wrote:
> >>> On Nov 3, 12:33 am, "Just Another Victim of the Ambient Morality"
> >>><ihates... at hotmail.com> wrote:
> >>>>     It has recursion in it but that's not sufficient to call it a
> >>>> recursive
> >>>> descent parser any more than having a recursive implementation of the
> >>>> factorial function.  The important part is what it recurses through...
>
> >>><snip>
>
> >>>> In my opinion, the rule set I mentioned in my original post:
>
> >>>> grammar = OneOrMore(Word(alphas)) + Literal('end')
>
> >>>>     ...should be translated (internally) to something like this:
>
> >>>> word = Word(alphas)
> >>>> grammar = Forward()
> >>>> grammar << ((word + grammar) | (word + Literal(end)))
>
> >>>>     This allows a recursive search to find the string correct without
> >>>> any
> >>>> need for "backtracking," if I understand what you mean by that.
> >>>> Couldn't
> >>>> pyparsing have done something like this?
>
> >>> Dear JAVotAM -
>
> >>> This was a great post!  (Although I'm not sure the comment in the
> >>> first paragraph was entirely fair - I know the difference between
> >>> recursive string parsing and recursive multiplication.)
>
> >>> You really got me thinking more about how this recursion actually
> >>> behaves, especially with respect to elements such as OneOrMore.  Your
> >>> original question is really quite common, and up until now, my stock
> >>> answer has been to use negative lookahead.  The expression you posted
> >>> is the first time I've seen this solution, and it *does* work.
>
> >> Is there not an ambiguity in the grammar?
>
> >> In EBNF:
>
> >>  goal --> WORD { WORD } END
>
> >>  WORD is '[a-zA-Z]+'
> >>  END is 'end'
>
> >> I think it is fine that PyParsing can't guess what the composer
> >> of that grammar meant.
>
> > First, I don't know if that constitutes an ambiguity in the
> > grammar. 'end' is a word but it's unambiguous that this grammar
> > must end in a literal 'end'.  You could interpret the input as
> > just a sequence of words or you could interpret it as a
> > sequence of words terminated by the word 'end'.
>
> Yeah. If it were a regex, it would be: '[ab]+b'. That is a fine
> regex, because a regex is generally just a recognizer; the
> ambiguity doesn't have to do with recognizing the language.  But
> PyParsing is parser. The ambiguity is in deciding what's a
> Word(alphas) and what's an 'end'.
>
> > One interpretation conforms to the grammar while the other
> > doesn't. You would assume that the interpretation that agrees
> > with the grammar would be the preferable choice and so should
> > the program. Secondly, even if it is an ambiguity... so what?
> > pyparsing's current behaviour is to return a parse error,
> > pretending that the string can't be parsed.  Ideally, perhaps
> > it should alert you to the ambiguity but, surely, it's better
> > to return _a_ valid parsing than to pretend that the string
> > can't be parsed at all...
>
> I wouldn't characterize it as pretending. How would you parse:
>
>   hello end hello end
>
> "WORD END WORD END" and "WORD WORD WORD END" are both valid
> interpretations, according to the grammar.
>
> As soon as you remove the ambiguity from the grammar, PyParsing
> starts to work correctly.

I think you are correct about this. But I'm not sure how much it shall
matter. Just take a look at Pythons Grammar

http://svn.python.org/view/python/trunk/Grammar/Grammar?rev=55446&view=markup

Without special keyword treatment this grammar would be ambigous and
couldn't be parsed using an LL(1) parser. The "grammar compiler" which
builds the parser tables creates a special label for each keyword.
This label is filtered when a NAME token is feeded into the parser.
With the label that belongs to e.g. 'if' or 'while' the correct
statement can be selected in constant time. Same happens when I use
the parser generator with your EBNF grammar. With a little more
adaption also NUMBER token could be filtered. But this would be
overdesign.

Theoretical beauty is compromised here using reasonable default
assumptions for keeping the grammar simple ( "convention over
configuration" to borrow a Rails slogan ).

Tokenization is another issue in Python. It is indeed somewhat special
due to significant whitespace and line continuation but tokenization
is conceptually much simpler and one would likely throw all kinds of
options and special case handling in the lexical analysis phase.