Is pyparsing really a recursive descent parser?

Mon Nov 5 02:47:44 EST 2007

On Nov 5, 3:05 am, "Just Another Victim of the Ambient Morality"
<ihates... at hotmail.com> wrote:
> "Kay Schluehr" <kay.schlu... at gmx.net> wrote in message
>
> news:1194223837.104424.242360 at o3g2000hsb.googlegroups.com...
>
> > On Nov 4, 10:44 pm, "Just Another Victim of the Ambient Morality"
> > <ihates... at hotmail.com>
>
> >> I believe there is a cure and it's called recursive descent parsing.
> >> It's slow, obviously, but it's correct and, sometimes (arguably, often),
> >> that's more important the execution speed.
>
> > Recursive decendent parsing is not necessarily slow but from your
> > remarks above I infer you want a general RD parser with backtracking:
> > when one rule doesn't match, try another one to derive the current
> > symbol in the input stream.
>
>     I think I've just discovered a major hurdle in my understand of the
> problem.
>     You keep saying "with backtracking."  Why?  Isn't "backtracking"
> inherent in recursion?  So, why can't these alleged "recursive descent
> parsers" find valid parsings?  How are they not already backtracking?  What
> was the point of being recursive if not to take advantage of the inherent
> backtracking in it?
>     Obviously, these parsers aren't recursing through what I think they
> should be recursing.  The question is "why not?"

Backtracking and RD parsers are two different issues. An RD parser
keeps a grammar which is structured tree-like. So one feeds a start
symbol into the parser and the parser tries to "derive" a symbol of
the input stream by descending down along the tree of non-terminals.
For each non-terminal the parser calls itself because it is
essentially a recursive datastructure it iterates through.

Backtracking comes into play when the grammar is not properly "left
factored". This doesn't necessrily mean it is ambigous. Take for
example the following grammar:

S: A A 'foo' | A A 'bar'

The parser cannot immediately decide whether to take the left or the
right branch of the RHS.  So it will checkout the left branch and when
it fails to derive 'foo' in the input stream it tries to select the
right branch. But the grammar is *not* ambigous: there is always a
unique parse tree that can be derived ( or none ).

Parser theory is mostly concerned with finding strategies to avoid
backtracking ( and resolving ambiguities ) because it slows parsers
down. One always wants to have parser effort that depends linear on
the size of the input stream.

> > I'm not sure one needs to start again with a naive approach just to
> > avoid any parser theory. For a user of a parser it is quite important
> > whether she has to wait 50 seconds for a parse to run or 50
> > milliseconds. I don't like to compromise speed for implementation
> > simplicity here.
>
>     This attitude is all too prevalent among computer professionals...  Of
> course it's a useful thing to shield users from the intricacies of parser
> theory!

Yes, but people who develop parser generators shall no at least a bit
about it  ;)

> Just as much as it is useful to shield drivers from needing
> automotive engineering or software users from programing.  How many people
> have come to this newsgroup asking about anomalous pyparsing behaviour,
> despite their grammars being mathematically correct.
>     Think of it this way.  You can force all the clients of pyparsing to
> duplicate work on figuring out how to massage pyparsing to their grammars,
> or you can do the work of getting pyparsing to solve people's problems,
> once.  That's what a library is supposed to do...
>     Finally, I can't believe you complain about potential speed problems.
> First, depending on the size of the string, it's likely to be the difference
> between 2ms and 200ms.  Secondly, if speed were an issue, you wouldn't go
> with a recursive descent parser.

ANTLR is recursive descendend, so are ratpack parsers. Python uses an
extremely optimized recursive descendend table based parser for
parsing the language. The parser flies and it is not LALR. Grammars
are more accessible in EBNF style and simpler to write so many people
( like me ) prefer RD parsers and seek for ways to optimize them. As
always there is a separation of concerns. The software engineering
aspects of modularization, interface abstraction, library / framework
implementation and the algorithmic aspects. I'd like to see both being
addressed by a parser generator architecture.