text processing problem
Matt
matthew_shomphe at countrywide.com
Fri Apr 8 11:53:33 EDT 2005
Maurice LING wrote:
> Matt wrote:
> > I'd HIGHLY suggest purchasing the excellent <a
> > href="http://www.oreilly.com/catalog/regex2/index.html">Mastering
> > Regular Expressions</a> by Jeff Friedl. Although it's mostly
geared
> > towards Perl, it will answer all your questions about regular
> > expressions. If you're going to work with regexs, this is a
must-have.
> >
> > That being said, here's what the new regular expression should be
with
> > a bit of instruction (in the spirit of teaching someone to fish
after
> > giving them a fish ;-) )
> >
> > my_expr = re.compile(r'(\w+)\s*(\(\1\))')
> >
> > Note the "\s*", in place of the single space " ". The "\s" means
"any
> > whitespace character (equivalent to [ \t\n\r\f\v]). The "*"
following
> > it means "0 or more occurances". So this will now match:
> >
> > "there (there)"
> > "there (there)"
> > "there(there)"
> > "there (there)"
> > "there\t(there)" (tab)
> > "there\t\t\t\t\t\t\t\t\t\t\t\t(there)"
> > etc.
> >
> > Hope that's helpful. Pick up the book!
> >
> > M@
> >
>
> Thanks again. I've read a number of tutorials on regular expressions
but
> it's something that I hardly used in the past, so gone far too rusty.
>
> Before my post, I've tried
> my_expr = re.compile(r'(\w+) \s* (\(\1\))') instead but it doesn't
work,
> so I'm a bit stumped......
>
> Thanks again,
> Maurice
Maurice,
The reason your regex failed is because you have spaces around the
"\s*". This translates to "one space, followed by zero or more
whitespace elements, followed by one space". So your regex would only
match the two text elements separated by at least 2 spaces.
This kind of demostrates why regular expressions can drive you nuts.
I still suggests picking up the book; not because Jeff Friedl drove a
dump truck full of money up to my door, but because it specifically has
a use case like yours. So you get to learn & solve your problem at the
same time!
HTH,
M@
More information about the Python-list
mailing list