Distributions, RE-verb and the like

Paul McGuire ptmcg at austin.rr.com
Thu Dec 29 18:55:39 EST 2005


Oh, the pyparsing rendition of your initial pat expression would be
something like:

import pyparsing as pp
pat = pp.Combine( pp.oneOf("$ 0x 0X") + pp.Word(pp.hexnums,max=8) )

Combine is needed to ensure that the leading $, 0x, or 0X is
immediately followed by 1-8 (and no more than 8) hex digits.
Otherwise, pyparsing is pretty tolerant of whitespace cropping up
wherever.

As for some of your other syntaxes:

I'm not sure what "Vre" means.

I found that "Alternative" needs to support both greedy and non-greedy
matches, so I provided Or and MatchFirst, respectively.  They are also
definable using '^' and '|' operators, again respectively.  Finally, I
ran into Literal("this") | Literal("that") | Literal("other") so often
that I just added a helper method oneOf that would take the string
"this that other" and build the right expression out of it.  This too
is non-trivial, as you have to take care that some short literals may
mask longer ones in the list, as in oneOf("< = > <= >= !=").  Just
replacing this directly with Literal("<") | Literal("=") | ... would
prevent any matching of the ">=" or "<=" literals.  You could replace
with the Or (^) form, but this exhaustively checks all alternatives all
the time, a regrettable run-time performance penalty.  Pyparsing's
implementation of oneOf leaves the literals in the given order, unless
a duplicate is given, or an earlier literal masks a later one - in that
case, the longer literal is moved ahead of the shorter.

I implemented Optional as a wrapper-type class, as opposed to the
.optional() method that you have given - I'd say there are tradeoffs
either way, just making the comparison.

Your "repeated" or "times" seem to map roughly to pyparsing's OneOrMore
and ZeroOrMore.

Any thought how a recursive grammar might look?

I don't find 'Interval' to be very easy on the eyes.  In this case, I
stole^H^H^H^H^H borrowed the re form of "[A-Za-z0-9]", providing a
method named srange ("s" is for "string") such that srange("a-fA-F")
would return the string "abcdefABCDEF".

The other end of this process has to do with how the calling program
will process the parsed results.  Once a grammar gets too deeply
nested, or has too many Optional elements, just returning a simple list
or nested list of tokens isn't enough.  Pyparsing returns ParseResults
objects, which can be accessed as a list, dictionary, or object with
attributes (provided individual fields have been given names at grammar
definition time).  I *have* had some complaints about ParseResults
("ParseResults are evil"), but the named access is a life-saver for
complex grammars.  (Simple case, the first token for your hex number is
an optional sign - without names, you can't just access field 2, say,
of the expression, you have to first test to see if the sign was
provided or not, and then access field 2 or 3 accordingly.  On the
other hand, if you had given field 2 a name, your parser would be more
robust, even you later changed your grammar to include other elements,
such as a leading, um, currency symbol or something.)

Just some fodder for your reverb considerations...

-- Paul




More information about the Python-list mailing list