text processing problem

Thu Apr 7 21:00:56 EDT 2005

Maurice LING wrote:
> Matt wrote:
> >
> >
> > Try this:
> > import re
> > my_expr = re.compile(r'(\w+) (\(\1\))')
> > s = "this is (is) a test"
> > print my_expr.sub(r'\1', s)
> > #prints 'this is a test'
> >
> > M@
> >
>
> Thank you Matt. It works out well. The only think that gives it
problem
> is in events as "there  (there)", where between the word and the same

> bracketted word is more than one whitespaces...
>
> Cheers
> Maurice

Maurice,
I'd HIGHLY suggest purchasing the excellent <a
href="http://www.oreilly.com/catalog/regex2/index.html">Mastering
Regular Expressions</a> by Jeff Friedl.  Although it's mostly geared
towards Perl, it will answer all your questions about regular
expressions.  If you're going to work with regexs, this is a must-have.

That being said, here's what the new regular expression should be with
a bit of instruction (in the spirit of teaching someone to fish after
giving them a fish ;-)   )

my_expr = re.compile(r'(\w+)\s*(\(\1\))')

Note the "\s*", in place of the single space " ".  The "\s" means "any
whitespace character (equivalent to [ \t\n\r\f\v]).  The "*" following
it means "0 or more occurances".  So this will now match:

"there  (there)"
"there (there)"
"there(there)"
"there                                          (there)"
"there\t(there)" (tab)
"there\t\t\t\t\t\t\t\t\t\t\t\t(there)"
etc.

Hope that's helpful.  Pick up the book!

M@