PyParsing and Headaches

Hugo Ferreira bytter at gmail.com
Wed Nov 22 17:06:31 EST 2006


Chris,

Thanks for your quick answer. That changes a lot of stuff, and now I'm able
to do my parsing as I intended to.

Paul,

Thanks for your detailed explanation. One of the things I think is missing
from the documentation (or that I couldn't find easy) is the kind of
explanation you give about 'The Way of PyParsing'. For example, It took me a
while to understand that I could easily implement simple recursions using
OneOrMany(Group()). Or maybe things were out there and I didn't searched
enough...

Still, fwiw, congratulations for the library. PyParsing allowed me to do in
just a couple of hours, including learning about it's API (minus this little
inconvenient) what would have taken me a couple of days with, for example,
ANTLR (in fact, I've already put aside ANTLR more than once in the past for
a built-from-scratch parser).

Cheers,

Hugo Ferreira

On 11/22/06, Paul McGuire <ptmcg at austin.rr._bogus_.com> wrote:
>
> "Bytter" <bytter at gmail.com> wrote in message
> news:1164223072.219676.157790 at j44g2000cwa.googlegroups.com...
> > Hi,
> >
> > I'm trying to construct a parser, but I'm stuck with some basic
> > stuff... For example, I want to match the following:
> >
> > letter = "A"..."Z" | "a"..."z"
> > literal = letter+
> > include_bool := "+" | "-"
> > term = [include_bool] literal
> >
> > So I defined this as:
> >
> > literal = Word(alphas)
> > include_bool = Optional(oneOf("+ -"))
> > term = include_bool + literal
> >
> > The problem is that:
> >
> > term.parseString("+a") -> (['+', 'a'], {}) # OK
> > term.parseString("+ a") -> (['+', 'a'], {}) # KO. It shouldn't
> > recognize any token since I didn't said the SPACE was allowed between
> > include_bool and literal.
> >
>
> As Chris pointed out in his post, the most direct way to fix this is to
> use
> Combine.  Note that Combine does two things: it requires the expressions
> to
> be adjacent, and it combines the results into a single token.  For
> instance,
> when defining the expression for a real number, something like:
>
> realnum = Optional(oneOf("+ -")) + Word(nums) + "." + Word(nums)
>
> Pyparsing would parse "3.14159" into the separate tokens ['', '3', '.',
> '14159'].  For this grammar, pyparsing would also accept "2. 23" as ['',
> '2', '.', '23'], even though there is a space between the decimal point
> and
> "23".  But by wrapping it inside Combine, as in:
>
> realnum = Combine(Optional(oneOf("+ -")) + Word(nums) + "." + Word(nums))
>
> we accomplish two things: pyparsing only matches if all the elements are
> adjacent, with no whitespace or comments; and the matched token is
> returned
> as ['3.14159'].  (Yes, I left off scientific notation, but it is an
> extension of the same issue.)
>
> Pyparsing in general does implicit whitespace skipping; it is part of the
> zen of pyparsing, and distinguishes it from conventional regexps (although
> I
> think there is a new '?' switch for re's that puts '\s*'s between re terms
> for you).  This is to simplify the grammar definition, so that it doesn't
> need to be littered with "optional whitespace or comments could go here"
> expressions; instead, whitespace and comments (or "ignorables" in
> pyparsing
> terminology) are parsed over before every grammar expression.  I
> instituted
> this out of recoil from a previous project, in which a co-developer
> implemented a boolean parser by first tokenizing by whitespace, then
> parsing
> out the tokens.  Unfortunately, this meant that "color=='blue' &&
> size=='medium'" would not parse successfully, instead requiring "color ==
> 'blue' && size == 'medium'".  It doesn't seem like much, but our support
> guys got many calls asking why the boolean clauses weren't matching.  I
> decided that when I wrote a parser, "y=m*x+b" would be just as parseable
> as
> "y = m * x + b".  For that matter, you'd be surprised where whitespace and
> comments sneak in to people's source code: spaces after left parentheses
> and
> comments after semicolons, for example, are easily forgotten when spec'ing
> out the syntax for a C "for" statement; whitespace inside HTML tags is
> another unanticipated surprise.
>
> So looking at your grammar, you say you don't want to have this be a
> successful parse:
>     term.parseString("+ a") -> (['+', 'a'], {})
>
> because, "It shouldn't recognize any token since I didn't said the SPACE
> was
> allowed between include_bool and literal."  In fact, pyparsing allows
> spaces
> by default, that's why the given parse succeeds.  I would turn this
> question
> around, and ask you in terms of your grammar - what SHOULD be allowed
> between include_bool and literal?  If spaces are not a problem, then your
> grammar as-is is sufficient.  If spaces are absolutely verboten, then
> there
> are 2 or 3 different techniques in pyparsing to disable the
> whitespace-skipping behavior, depending on whether you want all whitespace
> skipping disabled, just for literals of a certain type, or just for
> literals
> when following a leading include_bool sign.
>
> Thanks for giving pyparsing a try; if you want further help, you can post
> here, or on the pyparsing wiki - the discussion threads on the Home page
> are
> a pretty good support and message log.
>
> -- Paul
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>



-- 
GPG Fingerprint: B0D7 1249 447D F5BB 22C5  5B9B 078C 2615 504B 7B85
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20061122/c8fa306f/attachment.html>


More information about the Python-list mailing list