Is pyparsing really a recursive descent parser?

Sun Nov 4 07:54:34 EST 2007

On 2007-11-04, Kay Schluehr <kay.schluehr at gmx.net> wrote:
> On 4 Nov., 03:07, Neil Cerutti <horp... at yahoo.com> wrote:
>> I wouldn't characterize it as pretending. How would you parse:
>>
>>   hello end hello end
>>
>> "WORD END WORD END" and "WORD WORD WORD END" are both valid
>> interpretations, according to the grammar.
>>
>> As soon as you remove the ambiguity from the grammar, PyParsing
>> starts to work correctly.
>
> I think you are correct about this. But I'm not sure how much
> it shall matter. Just take a look at Pythons Grammar
>
> http://svn.python.org/view/python/trunk/Grammar/Grammar?rev=55446&view=markup
>
> Without special keyword treatment this grammar would be
> ambigous and couldn't be parsed using an LL(1) parser. 

I agree. I don't know how easy it is to create keywords using
PyParsing, or if the grammar in question would still be
considered correct by the author.

> The "grammar compiler" which builds the parser tables creates a
> special label for each keyword. This label is filtered when a
> NAME token is feeded into the parser. With the label that
> belongs to e.g. 'if' or 'while' the correct statement can be
> selected in constant time. Same happens when I use the parser
> generator with your EBNF grammar. With a little more adaption
> also NUMBER token could be filtered. But this would be
> overdesign.
>
> Theoretical beauty is compromised here using reasonable default
> assumptions for keeping the grammar simple ( "convention over
> configuration" to borrow a Rails slogan ).

Keywords are practically ubiquitous. I never thought of them as
unbeautiful before.

> Tokenization is another issue in Python. It is indeed somewhat
> special due to significant whitespace and line continuation but
> tokenization is conceptually much simpler and one would likely
> throw all kinds of options and special case handling in the
> lexical analysis phase.

It might be a quick fix in PyParsing, which includes a Keyword
type, but without the semantics that are needed in this case. You
have to use (as suggested earlier) negative lookahead in either a
Regex or with the NotAny type.

>>> goal = OneOrMore(NotAny(Literal('end')) + Word(alphas)) + Literal('end')
>>> goal.parseString('hello hello hello end')
(['hello', 'hello', 'hello', 'end'], {})
>>> goal.parseString('hello end hello end')
(['hello', 'end'], {})

No scanner/tokenizer needed! ;)

-- 
Neil Cerutti