[Tutor] pattern expressions

Fri Nov 7 22:22:33 CET 2008

Paul McGuire a écrit :
 > Question 1:
 > format_code	:= '+' | '-' | '*' | '#'
 > I need to specify that a single, identical, format_code code may be
 > repeated.
 > Not that a there may be several one on a sequence.
 > format		:= (format_code)+
 > would catch '+-', which is wrong. I want only patterns such as '--',
 > '+++',...
 >
 >
 > This interpretation of '+' in your BNF is a bit out of the norm.  Usually
 > this notation 'format_code+' would accept 1 or more of any of your
 > format_code symbols, so '+-+--++' would match.
That's what I intended to write above. "(format_code)+ would catch '+-', which 
is wrong." I need a pattern that matches a repetition of the same token, this 
token beeing an item of a set. Of course, I could write a pattern for each 
token... but it is supposed to be programming, not cooking ;-)
What I'm looking for is a format that may not exist:
format		:= (format_code)++
where '++' means 'repetition of an identical token'

 > In pyparsing, you could match things like '----' using the Word class and
 > specifying a string containing the single character '-':  Word('-').  That
 > is, parse a word made up of '-' characters.  There is no pyparsing construct
 > that exactly matches your (format_code)+ repetition, but you could use Word
 > and MatchFirst as in:
 >
 > format = MatchFirst(Word(c) for c in "+-*#")

That's it! I had not realized that, as pyparsing is real puthon, one can also 
use python idioms /inside/ the grammar... good! thank you. So that it is also 
possible to have variables, no? Then, my question #2 should be solved, too.

 > A corresponding regular expression might be:
 > formatRE = '|'.join(re.escape(c)+'+' for c in "+-*#")
 >
 > which you could then parse using the re module, or wrap in a pyparsing Regex
 > object:
 >
 > format = Regex(formatRE)
 >
 >
 > Question 2:
 > style_code	:= '/' | '!' | '_'
 > Similar case, but different. I want patterns like:
 > styled_text	:= style plain_text style
 > where both style instances are identical. As the number of styles may grow
 > (and even be impredictable: the style_code line will actually be written at
 > runtime according to a config file) I don't want, and anyway can't, specify
 > all possible kinds of styled_text. Even if possible, it would be ugly!
 >
 > pyparsing includes to methods to help you match the same text that was
 > matched before - matchPreviousLiteral and matchPreviousExpr.  Here is how
 > your example would look:
 >
 > plain_text = Word(alphanums + " ")
 > styled_text = style + plain_text + matchPreviousLiteral(style)
 >
 > (There is similar capability in regular expressions, too.)

Good, thank you again. Do you know if there is any way to express such things 
in ordinary E/BNF, or in any dialect coming from BNF? It's like a variable 
inside a pattern, and I personly have never seen that.
Pattern variables would also be very helpful as (said before) I need to write 
or at least reconfigurate the grammar at runtime.

 > Question 3:
 > I would like to specify a "side-condition" for a pattern, meaning that it
 > should only match when a specific token lies aside. For instance:
 > A	:= A_pattern {X}
 > X is not part of the pattern, thus should not be extracted. If X is just
 > "garbage", I can write an enlarged pattern, then let it down later:
 > A	:= A_pattern
 > A_X	:= A X
 >
 > I think you might be looking for some kind of lookahead.  In pyparsing, this
 > is supported using the FollowedBy class.
 >
 > A_pattern = Word(alphas)
 > X = Literal(".")
 > A = A_pattern + FollowedBy(X).leaveWhitespace()
 >
 > print A.searchString("alskd sldjf sldfj. slfdj . slfjd slfkj.")
 >
 > prints
 >
 > [['sldfj'], ['slfkj']]

I guess there is the same for left-side conditions. I'm going to search myself. 
This guy who develops pyParsing thinks at everything. There are so many helper 
functions and processing methods -- how can you know all of that by heart, Paul ?

Denis