[Tutor] Simply Regular Expression or Matching question ?

Sat Jan 18 15:11:01 2003

Many thanks for the help on my 're' problem  .... I have read your solution
carefully Magnus and it was most helpful ..

Cheers
Graeme Andrew
Kiwi

----- Original Message -----
From: "Magnus Lycka" <magnus@thinkware.se>
To: "Bob Gailer" <ramrom@earthling.net>; "Graeme Andrew"
<graemea@xtra.co.nz>; <tutor@python.org>
Sent: Sunday, January 19, 2003 4:34 AM
Subject: Re: [Tutor] Simply Regular Expression or Matching question ?

> At 07:08 2003-01-18 -0700, Bob Gailer wrote:
> >At 12:36 AM 1/19/2003 +1300, Graeme Andrew wrote:
> >>parse code that includes some delimited text in the form {abcd ...} .
> >>
> >>Is there a simple 're' that will return all occurences of the text
between
> >>the curly brackets as descibed above
> >
> > >>> import re
> > >>> s = 'a{b}\nc{d}'
> > >>> re.findall(r'{.*}', s)
> >['{b}', '{d}']
>
> Almost there, but this code has three problems.
>
> First of all, it doesn't find the text *between* the curly
> braces. It finds the curly braces *and* the text between.
>
> There are two more problems. See below:
>  >>> s = ''' bla bla { some text
> ... that spans more than one line }
> ... {this}
> ... {is}
> ... {ok}
> ... {several} blocks {on} one {line}
> ... '''
>  >>> re.findall(r'{.*}', s)
> ['{this}', '{is}', '{ok}', '{several} blocks {on} one {line}']
>
> As you see, it only finds patterns where both { and } are
> one the same line. Secondly, it will match the first { and
> the last } on the line, which is hardly what we want...
>
> Let's fix these things one at a time. The manual is our friend
> http://www.python.org/doc/current/lib/re-syntax.html
>
> We should first remove the burly braces... Look in the manual:
>
> (...)
> Matches whatever regular expression is inside the parentheses, and
indicates
> the start and end of a group; the contents of a group can be retrieved
after
> a match has been performed, and can be matched later in the string with
the
> \number special sequence, described below. To match the literals "(" or
")",
> use \( or \), or enclose them inside a character class: [(] [)].
>
> So...
>  >>> re.findall(r'{(.*)}', s)
> ['this', 'is', 'ok', 'several} blocks {on} one {line']
>
> Now we find text *between braces*, not the braces themselves
> *and* text between. The things outside () is only there to
> provide context now, not to be a part of what we extract.
>
> Now, let's look at the problem with {}-blocks spanning several
> lines.
>
> "."
>      (Dot.) In the default mode, this matches any character except a
newline.
> If the DOTALL flag has been specified, this matches any character
including
> a newline.
>
> So, let's use the DOTALL flag. I think we need to compile a regular
> expression for that. (I always do that anyway.)
>
>  >>> pattern = re.compile(r'{(.*)}', re.DOTALL)
>  >>> pattern.findall(s)
> [' some text\nthat spans more than one line
> }\n{this}\n{is}\n{ok}\n{several} blocks {on} one {line']
>
> There we are. Now we just have to fix the last problem. But what's
> really the problem? Let's read some more. What are we doing?
>
> . means any character (including or excluding \n depending on re.DOTALL)
> * means 0 or more repetitions of whatever was before. ('.' in this case.)
>
> {.*} Means start with { and end with } and anything in between. This could
> be interpreted in two ways. Either find as much as possible in .* that
will
> satisfy the whole re...and that is what happens now. It's called a greedy
> match. The other option is to get as little as possible in .* that will
still
> satisfy the requirement of a } after. That is what we want, right? Read
the
> manual again.
>
> *?, +?, ??
> The "*", "+", and "?" qualifiers are all greedy; they match as much text
as
> possible. Sometimes this behaviour isn't desired; if the RE <.*> is
matched
> against '<H1>title</H1>', it will match the entire string, and not just
> '<H1>'. Adding "?" after the qualifier makes it perform the match in
> non-greedy or minimal fashion; as few characters as possible will be
> matched. Using .*? in the previous expression will match only '<H1>'.
>
> Seems simple...
>
>  >>> pattern = re.compile(r'{(.*?)}', re.DOTALL)
>  >>> pattern.findall(s)
> [' some text\nthat spans more than one line ', 'this', 'is', 'ok',
> 'several', 'on', 'line']
>
> Better like this?
>
> Do you want more, some removal of whitespace for instance?
>
>
> --
> Magnus Lycka, Thinkware AB
> Alvans vag 99, SE-907 50 UMEA, SWEDEN
> phone: int+46 70 582 80 65, fax: int+46 70 612 80 65
> http://www.thinkware.se/  mailto:magnus@thinkware.se
>