[Tutor] Simply Regular Expression or Matching question ?

Sat Jan 18 11:38:26 2003

--=======2A477B65=======
Content-Type: text/plain; x-avg-checked=avg-ok-B74577E; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 8bit

Thanks for expanding my solution into a really useful pattern. I've been 
working with re for under a month, and it definitely has a learning curve.

At 04:34 PM 1/18/2003 +0100, Magnus Lycka wrote:

>At 07:08 2003-01-18 -0700, Bob Gailer wrote:
>>At 12:36 AM 1/19/2003 +1300, Graeme Andrew wrote:
>>>parse code that includes some delimited text in the form {abcd ...} .
>>>
>>>Is there a simple 're' that will return all occurences of the text between
>>>the curly brackets as descibed above
>>
>> >>> import re
>> >>> s = 'a{b}\nc{d}'
>> >>> re.findall(r'{.*}', s)
>>['{b}', '{d}']
>
>Almost there, but this code has three problems.
>
>First of all, it doesn't find the text *between* the curly
>braces. It finds the curly braces *and* the text between.
>
>There are two more problems. See below:
> >>> s = ''' bla bla { some text
>... that spans more than one line }
>... {this}
>... {is}
>... {ok}
>... {several} blocks {on} one {line}
>... '''
> >>> re.findall(r'{.*}', s)
>['{this}', '{is}', '{ok}', '{several} blocks {on} one {line}']
>
>As you see, it only finds patterns where both { and } are
>one the same line. Secondly, it will match the first { and
>the last } on the line, which is hardly what we want...
>
>Let's fix these things one at a time. The manual is our friend
>http://www.python.org/doc/current/lib/re-syntax.html
>
>We should first remove the burly braces... Look in the manual:
>
>(...)
>Matches whatever regular expression is inside the parentheses, and indicates
>the start and end of a group; the contents of a group can be retrieved after
>a match has been performed, and can be matched later in the string with the
>\number special sequence, described below. To match the literals "(" or ")",
>use \( or \), or enclose them inside a character class: [(] [)].
>
>So...
> >>> re.findall(r'{(.*)}', s)
>['this', 'is', 'ok', 'several} blocks {on} one {line']
>
>Now we find text *between braces*, not the braces themselves
>*and* text between. The things outside () is only there to
>provide context now, not to be a part of what we extract.
>
>Now, let's look at the problem with {}-blocks spanning several
>lines.
>
>"."
>     (Dot.) In the default mode, this matches any character except a newline.
>If the DOTALL flag has been specified, this matches any character including
>a newline.
>
>So, let's use the DOTALL flag. I think we need to compile a regular
>expression for that. (I always do that anyway.)
>
> >>> pattern = re.compile(r'{(.*)}', re.DOTALL)
> >>> pattern.findall(s)
>[' some text\nthat spans more than one line 
>}\n{this}\n{is}\n{ok}\n{several} blocks {on} one {line']
>
>There we are. Now we just have to fix the last problem. But what's
>really the problem? Let's read some more. What are we doing?
>
>. means any character (including or excluding \n depending on re.DOTALL)
>* means 0 or more repetitions of whatever was before. ('.' in this case.)
>
>{.*} Means start with { and end with } and anything in between. This could
>be interpreted in two ways. Either find as much as possible in .* that will
>satisfy the whole re...and that is what happens now. It's called a greedy
>match. The other option is to get as little as possible in .* that will still
>satisfy the requirement of a } after. That is what we want, right? Read the
>manual again.
>
>*?, +?, ??
>The "*", "+", and "?" qualifiers are all greedy; they match as much text as
>possible. Sometimes this behaviour isn't desired; if the RE <.*> is matched
>against '<H1>title</H1>', it will match the entire string, and not just
>'<H1>'. Adding "?" after the qualifier makes it perform the match in
>non-greedy or minimal fashion; as few characters as possible will be
>matched. Using .*? in the previous expression will match only '<H1>'.
>
>Seems simple...
>
> >>> pattern = re.compile(r'{(.*?)}', re.DOTALL)
> >>> pattern.findall(s)
>[' some text\nthat spans more than one line ', 'this', 'is', 'ok', 
>'several', 'on', 'line']
>
>Better like this?
>
>Do you want more, some removal of whitespace for instance?
>
>
>--
>Magnus Lycka, Thinkware AB
>Alvans vag 99, SE-907 50 UMEA, SWEDEN
>phone: int+46 70 582 80 65, fax: int+46 70 612 80 65
>http://www.thinkware.se/  mailto:magnus@thinkware.se
>
>
>_______________________________________________
>Tutor maillist  -  Tutor@python.org
>http://mail.python.org/mailman/listinfo/tutor
>
>
>
>
>---
>Incoming mail is certified Virus Free.
>Checked by AVG anti-virus system (http://www.grisoft.com).
>Version: 6.0.438 / Virus Database: 246 - Release Date: 1/7/2003

Bob Gailer
mailto:ramrom@earthling.net
303 442 2625

--=======2A477B65=======
Content-Type: text/plain; charset=us-ascii; x-avg=cert; x-avg-checked=avg-ok-B74577E
Content-Disposition: inline

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.438 / Virus Database: 246 - Release Date: 1/7/2003

--=======2A477B65=======--