Extract sentences in nested parentheses using Python

Tue Dec 3 08:41:18 EST 2019

On Tuesday, 3 December 2019 01:01:25 UTC+8, Peter Otten  wrote:
> A S wrote:
> 
> I think I've seen this question before ;)
> 
> > I am trying to extract all strings in nested parentheses (along with the
> > parentheses itself) in my .txt file. Please see the sample .txt file that
> > I have used in this example here:
> > (https://drive.google.com/open?id=1UKc0ZgY9Fsz5O1rSeBCLqt5dwZkMaQgr).
> > 
> > I have tried and done up three different codes but none of them seems to
> > be able to extract all the nested parentheses. They can only extract a
> > portion of the nested parentheses. Any advice on what I've done wrong
> > could really help!
> > 
> > Here are the three codes I have done so far:
> > 
> > 1st attempt:
> > 
> > import re
> > from os.path import join
> > 
> > def balanced_braces(args):
> >     parts = []
> >     for arg in args:
> >         if '(' not in arg:
> >             continue
> 
> There could still be a ")" that you miss
> 
> >         chars = []
> >         n = 0
> >         for c in arg:
> >             if c == '(':
> >                 if n > 0:
> >                     chars.append(c)
> >                 n += 1
> >             elif c == ')':
> >                 n -= 1
> >                 if n > 0:
> >                     chars.append(c)
> >                 elif n == 0:
> >                     parts.append(''.join(chars).lstrip().rstrip())
> >                     chars = []
> >             elif n > 0:
> >                 chars.append(c)
> >     return parts
> 
> It's probably easier to understand and implement when you process the 
> complete text at once. Then arbitrary splits don't get in the way of your 
> quest for ( and ). You just have to remember the position of the first 
> opening ( and number of opening parens that have to be closed before you 
> take the complete expression:
> 
> level:  00011112222100
> text:   abc(def(gh))ij
>    when we are here^
>     we need^
> 
> A tentative implementation:
> 
> $ cat parse.py
> import re
> 
> NOT_SET = object()
> 
> def scan(text):
>     level = 0
>     start = NOT_SET
>     for m in re.compile("[()]").finditer(text):
>         if m.group() == ")":
>             level -= 1
>             if level < 0:
>                 raise ValueError("underflow: more closing than opening 
> parens")
>             if level == 0:
>                 # outermost closing parenthesis:
>                 # deliver enclosed string including parens.
>                 yield text[start:m.end()]
>                 start = NOT_SET
>         elif m.group() == "(":
>             if level == 0:
>                 # outermost opening parenthesis: remember position.
>                 assert start is NOT_SET
>                 start = m.start()
>             level += 1
>         else:
>             assert False
>     if level > 0:
>         raise ValueError("unclosed parens remain")
> 
> 
> if __name__ == "__main__":
>     with open("lan sample text file.txt") as instream:
>         text = instream.read()
>     for chunk in scan(text):
>         print(chunk)
> $ python3 parse.py
> ("xE'", PUT(xx.xxxx.),"'")
> ("TRUuuuth")

Hello Peter! I tried this on my actual working files and it returned this error: "unclosed parens remain". In this case, how can I continue to parse through my text files by only extracting those with balanced parentheses and ignore those that are incomplete?