PyParsing and Headaches

Wed Nov 22 16:42:26 EST 2006

"Bytter" <bytter at gmail.com> wrote in message 
news:1164223072.219676.157790 at j44g2000cwa.googlegroups.com...
> Hi,
>
> I'm trying to construct a parser, but I'm stuck with some basic
> stuff... For example, I want to match the following:
>
> letter = "A"..."Z" | "a"..."z"
> literal = letter+
> include_bool := "+" | "-"
> term = [include_bool] literal
>
> So I defined this as:
>
> literal = Word(alphas)
> include_bool = Optional(oneOf("+ -"))
> term = include_bool + literal
>
> The problem is that:
>
> term.parseString("+a") -> (['+', 'a'], {}) # OK
> term.parseString("+ a") -> (['+', 'a'], {}) # KO. It shouldn't
> recognize any token since I didn't said the SPACE was allowed between
> include_bool and literal.
>

As Chris pointed out in his post, the most direct way to fix this is to use 
Combine.  Note that Combine does two things: it requires the expressions to 
be adjacent, and it combines the results into a single token.  For instance, 
when defining the expression for a real number, something like:

realnum = Optional(oneOf("+ -")) + Word(nums) + "." + Word(nums)

Pyparsing would parse "3.14159" into the separate tokens ['', '3', '.', 
'14159'].  For this grammar, pyparsing would also accept "2. 23" as ['', 
'2', '.', '23'], even though there is a space between the decimal point and 
"23".  But by wrapping it inside Combine, as in:

realnum = Combine(Optional(oneOf("+ -")) + Word(nums) + "." + Word(nums))

we accomplish two things: pyparsing only matches if all the elements are 
adjacent, with no whitespace or comments; and the matched token is returned 
as ['3.14159'].  (Yes, I left off scientific notation, but it is an 
extension of the same issue.)

Pyparsing in general does implicit whitespace skipping; it is part of the 
zen of pyparsing, and distinguishes it from conventional regexps (although I 
think there is a new '?' switch for re's that puts '\s*'s between re terms 
for you).  This is to simplify the grammar definition, so that it doesn't 
need to be littered with "optional whitespace or comments could go here" 
expressions; instead, whitespace and comments (or "ignorables" in pyparsing 
terminology) are parsed over before every grammar expression.  I instituted 
this out of recoil from a previous project, in which a co-developer 
implemented a boolean parser by first tokenizing by whitespace, then parsing 
out the tokens.  Unfortunately, this meant that "color=='blue' && 
size=='medium'" would not parse successfully, instead requiring "color == 
'blue' && size == 'medium'".  It doesn't seem like much, but our support 
guys got many calls asking why the boolean clauses weren't matching.  I 
decided that when I wrote a parser, "y=m*x+b" would be just as parseable as 
"y = m * x + b".  For that matter, you'd be surprised where whitespace and 
comments sneak in to people's source code: spaces after left parentheses and 
comments after semicolons, for example, are easily forgotten when spec'ing 
out the syntax for a C "for" statement; whitespace inside HTML tags is 
another unanticipated surprise.

So looking at your grammar, you say you don't want to have this be a 
successful parse:
    term.parseString("+ a") -> (['+', 'a'], {})

because, "It shouldn't recognize any token since I didn't said the SPACE was 
allowed between include_bool and literal."  In fact, pyparsing allows spaces 
by default, that's why the given parse succeeds.  I would turn this question 
around, and ask you in terms of your grammar - what SHOULD be allowed 
between include_bool and literal?  If spaces are not a problem, then your 
grammar as-is is sufficient.  If spaces are absolutely verboten, then there 
are 2 or 3 different techniques in pyparsing to disable the 
whitespace-skipping behavior, depending on whether you want all whitespace 
skipping disabled, just for literals of a certain type, or just for literals 
when following a leading include_bool sign.

Thanks for giving pyparsing a try; if you want further help, you can post 
here, or on the pyparsing wiki - the discussion threads on the Home page are 
a pretty good support and message log.

-- Paul