How to do this with groupby (or otherwise)? (Was: iterblocks cookbook example)
Gerard Flanagan
grflanagan at yahoo.co.uk
Mon Jun 4 10:42:11 EDT 2007
On Jun 4, 1:52 pm, Gerard Flanagan <grflana... at yahoo.co.uk> wrote:
> On Jun 2, 10:47 pm, Raymond Hettinger <pyt... at rcn.com> wrote:
>
>
>
> > On Jun 2, 10:19 am, Steve Howell <showel... at yahoo.com> wrote:
>
> > > George Sakkis produced the following cookbook recipe,
> > > which addresses a common problem that comes up on this
> > > mailing list:
>
> > ISTM, this is a common mailing list problem because it is fun
> > to solve, not because people actually need it on a day-to-day basis.
>
> > In that spirit, it would be fun to compare several different
> > approaches to the same problem using re.finditer, itertools.groupby,
> > or the tokenize module. To get the ball rolling, here is one variant:
>
> > from itertools import groupby
>
> > def blocks(s, start, end):
> > def classify(c, ingroup=[0], delim={start:2, end:3}):
> > result = delim.get(c, ingroup[0])
> > ingroup[0] = result in (1, 2)
> > return result
> > return [tuple(g) for k, g in groupby(s, classify) if k == 1]
>
> > print blocks('the <quick> brown <fox> jumped', start='<', end='>')
>
> > One observation is that groupby() is an enormously flexible tool.
> > Given a well crafted key= function, it makes short work of almost
> > any data partitioning problem.
>
> Can anyone suggest a function that will split text by paragraphs, but
> NOT if the paragraphs are contained within a [quote]...[/quote]
> construct. In other words, the following text should yield 3 blocks
> not 6:
>
> TEXT = '''
> Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
> Pellentesque dolor quam, dignissim ornare, porta et,
> auctor eu, leo. Phasellus malesuada metus id magna.
>
> [quote]
> Only when flight shall soar
> not for its own sake only
> up into heaven's lonely
> silence, and be no more
>
> merely the lightly profiling,
> proudly successful tool,
> playmate of winds, beguiling
> time there, careless and cool:
>
> only when some pure Whither
> outweighs boyish insistence
> on the achieved machine
>
> will who has journeyed thither
> be, in that fading distance,
> all that his flight has been.
> [/quote]
>
> Integer urna nulla, tempus sit amet, ultrices interdum,
> rhoncus eget, ipsum. Cum sociis natoque penatibus et
> magnis dis parturient montes, nascetur ridiculus mus.
> '''
>
> Other info:
>
> * don't worry about nesting
> * the [quote] and [/quote] musn't be stripped.
>
> Gerard
(Sorry if I ruined the parent thread.) FWIW, I didn't get a groupby
solution but with some help from the Python Cookbook (O'Reilly), I
came up with the following:
import re
RE_START_BLOCK = re.compile('^\[[\w|\s]*\]$')
RE_END_BLOCK = re.compile('^\[/[\w|\s]*\]$')
def iter_blocks(lines):
block = []
inblock = False
for line in lines:
if line.isspace():
if inblock:
block.append(line)
elif block:
yield block
block = []
else:
if RE_START_BLOCK.match(line):
inblock = True
elif RE_END_BLOCK.match(line):
inblock = False
block.append(line.lstrip())
if block:
yield block
More information about the Python-list
mailing list