parenthesis

Mon Nov 4 17:05:11 EST 2002

On 4 Nov 2002 12:24:31 -0800, mis6 at pitt.edu (Michele Simionato) wrote:

>Suppose I want to parse the following expression:
>
>>>> exp='(a*(b+c*(2-x))+d)+f(s1)'
>
>I want to extract the first part, i.e. '(a*(b+c*(2-x))+d)'.
>
>Now if I use a greedy regular expression
>
>>>> import re; greedy=re.compile('\(.*\)')
>
>I obtain to much, the full expression:
>
>>>> match=greedy.search(exp); match.group()
>
>'(a*(b+c*(2-x))+d)+f(s1)'
>
>On the other hand, if I use a nongreedy regular expression
>
>>>> nongreedy=re.compile('\(.*?\)')
>
>I obtain too little:
>
>>>> match=nongreedy.search(exp); match.group()
>
>'(a*(b+c*(2-x)'
>
>Is there a way to specify a clever regular expression able to match
>the first parenthesized group  ? What I did, was to write a routine
>to extract the first parenthesized group:
>
>def parenthesized_group(exp):
>    nesting_level,out=0,[]
>    for c in exp:
>	out.append(c)
>        if c=='(': nesting_level+=1
>	elif c==')': nesting_level-=1
>	if nesting_level==0: break
>    return ''.join(out)
>
>>>> print parenthesized_group(exp)
>
>(a*(b+c*(2-x))+d)
>
>Still, this seems to me not the best way to go and I would like to know
>if this can be done with a regular expression. Notice that I don't need
>to control all the nesting levels of the parenthesis, for me it is enough
>to recognize the end of the first parenthesized group.
>
>Obiously, I would like a general recipe valid for more complicate
>expressions: in particular I cannot assume that the first group ends 
>right before a mathematical operator (like '+' in this case) since
>these expressions are not necessarely mathematical expressions (as the
>example could wrongly suggest). In general I have expressions of the
>form
>
>( ... contains nested expressions with parenthesis... )...other stuff
>
>where other stuff may contain nested parenthesis. I can assume that 
>there are no errors, i.e. that all the internal open parenthesis are
>matched by closing parenthesis.
>
>Is this a problem which can be tackled with regular expressions ?
>
Well, they don't count, so if you want to count you have to throw in
something extra. E.g., you could do this, to insert a delimiter after
a closing right paren, and then split on the delimiter. Probably not
wonderfully efficient, and I am just duplicating what you did, except
the regex separates the chunks for me.

 >>> import re
 >>> rx = re.compile(r'([()]|[^()]*)')
 >>> class Addelim:
 ...     def __init__(self): self.parens=0
 ...     def __call__(self, m):
 ...         s = m.group(1)
 ...         if s=='(': self.parens+=1
 ...         if self.parens==1 and s==')':
 ...             self.parens=0
 ...             return s+'\x00'
 ...         if s==')': self.parens -=1
 ...         return s
 ...
 >>> for e in rx.sub(Addelim(),exp).split('\x00'): print e
 ...
 (a*(b+c*(2-x))+d)
 +f(s1)

Where exp was
 >>> exp
 '(a*(b+c*(2-x))+d)+f(s1)'

Regards,
Bengt Richter