parenthesis

Tue Nov 5 13:10:34 EST 2002

> Wondering why I didn't just write:
> 
>  >>> import re
>  >>> rx = re.compile(r'([()]|[^()]+)')
>  >>> class Addelim:
>  ...     def __init__(self, delim):
>  ...        self.parens=0; self.delim=delim
>  ...     def __call__(self, m):
>  ...         s = m.group(1)
>  ...         if s=='(': self.parens+=1
>  ...         if self.parens==1 and s==')':
>  ...             self.parens=0
>  ...             return s+self.delim
>  ...         if s==')': self.parens -=1
>  ...         return s
>  ...
>  >>> exp =  '(a*(b+c*(2-x))+d)+f(s1)'
> 
> It was natural to be able to specify the delimiter. And the + is probably
> better than the * on the non-paren "[^()]+" part of the pattern.

Not really. My benchmark gives essentially the same for "[^()]+*" and
"[^()]*", no sensible difference.

> Then using \n as delimiter to break into lines one can just print it.
> 
>  >>> print rx.sub(Addelim('\n'),exp)
>  (a*(b+c*(2-x))+d)
>  +f(s1)
> 
> Which you could also use like:
> 
>  >>> print rx.sub(Addelim('\n'),exp).splitlines()
>  ['(a*(b+c*(2-x))+d)', '+f(s1)']
> 
> Or to get back to your original requirement,
> 
>  >>> print rx.sub(Addelim('\n'),exp).splitlines()[0]
>  (a*(b+c*(2-x))+d)
> 
> But I suspect it would run faster to let a regex split the string and then use
> a loop like yours on the pieces, which would be '(' or ')' or some other string
> that you don't need to look at character by character. E.g.,
> 
>  >>> rx = re.compile(r'([()])')
>  >>> ss = rx.split(exp)
>  >>> ss
>  ['', '(', 'a*', '(', 'b+c*', '(', '2-x', ')', '', ')', '+d', ')', '+f', '(', 's1', ')', '']
> 
> Notice that the splitter matches wind up at the odd indices. I think that's generally true
> when you put parens around the splitting expression, to return the matches as part of the list,
> but I'm not 100% certain. Anyway, you could make use of that, something like:
> 
>  >>>
>  >>> parens = 0
>  >>> endix = []
>  >>> for i in range(1,len(ss),2):
>  ...     if parens==1 and ss[i]==')':
>  ...         parens=0; endix.append(i+1)
>  ...     elif ss[i]=='(': parens += 1
>  ...     else:            parens -= 1
>  ...
>  >>> endix
>  [12, 16]
> 
> You could break the loop like you did if you just want the first expression,
> or you could grab it by
> 
>  >>> print ''.join(ss[:endix[0]])
>  (a*(b+c*(2-x))+d)
> 
> or list the bunch,
> 
>  >>> lo=0
>  >>> for hi in endix:
>  ...     print ''.join(ss[lo:hi])
>  ...     lo = hi
>  ...
>  (a*(b+c*(2-x))+d)
>  +f(s1)
> 
> or whatever. Which is not as slick, but probably faster if you had to do a bi-ig bunch of them.
> 
> I think when the fenceposts are simple, but you are mainly interested in the data between, splitting
> on a fencepost regex and processing the resulting list can be simpler and faster than trying to
> do it all with a complex regex.
> 
> Regards,
> Bengt Richter

I strongly suspect that in this simple problem the simple approach is by far
the fastest.

Bye,

                           Michele