How to do this with groupby (or otherwise)? (Was: iterblocks cookbook example)

Mon Jun 4 10:42:11 EDT 2007

On Jun 4, 1:52 pm, Gerard Flanagan <grflana... at yahoo.co.uk> wrote:
> On Jun 2, 10:47 pm, Raymond Hettinger <pyt... at rcn.com> wrote:
>
>
>
> > On Jun 2, 10:19 am, Steve Howell <showel... at yahoo.com> wrote:
>
> > > George Sakkis produced the following cookbook recipe,
> > > which addresses a common problem that comes up on this
> > > mailing list:
>
> > ISTM, this is a common mailing list problem because it is fun
> > to solve, not because people actually need it on a day-to-day basis.
>
> > In that spirit, it would be fun to compare several different
> > approaches to the same problem using re.finditer, itertools.groupby,
> > or the tokenize module.  To get the ball rolling, here is one variant:
>
> > from itertools import groupby
>
> > def blocks(s, start, end):
> >     def classify(c, ingroup=[0], delim={start:2, end:3}):
> >         result = delim.get(c, ingroup[0])
> >         ingroup[0] = result in (1, 2)
> >         return result
> >     return [tuple(g) for k, g in groupby(s, classify) if k == 1]
>
> > print blocks('the <quick> brown <fox> jumped', start='<', end='>')
>
> > One observation is that groupby() is an enormously flexible tool.
> > Given a well crafted key= function, it makes short work of almost
> > any data partitioning problem.
>
> Can anyone suggest a function that will split text by paragraphs, but
> NOT if the paragraphs are contained within a [quote]...[/quote]
> construct.  In other words, the following text should yield 3 blocks
> not 6:
>
> TEXT = '''
> Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
> Pellentesque dolor quam, dignissim ornare, porta et,
> auctor eu, leo. Phasellus malesuada metus id magna.
>
> [quote]
> Only when flight shall soar
> not for its own sake only
> up into heaven's lonely
> silence, and be no more
>
> merely the lightly profiling,
> proudly successful tool,
> playmate of winds, beguiling
> time there, careless and cool:
>
> only when some pure Whither
> outweighs boyish insistence
> on the achieved machine
>
> will who has journeyed thither
> be, in that fading distance,
> all that his flight has been.
> [/quote]
>
> Integer urna nulla, tempus sit amet, ultrices interdum,
> rhoncus eget, ipsum. Cum sociis natoque penatibus et
> magnis dis parturient montes, nascetur ridiculus mus.
> '''
>
> Other info:
>
> * don't worry about nesting
> * the [quote] and [/quote] musn't be stripped.
>
> Gerard

(Sorry if I ruined the parent thread.) FWIW, I didn't get a groupby
solution but with some help from the Python Cookbook (O'Reilly), I
came up with the following:

import re

RE_START_BLOCK = re.compile('^\[[\w|\s]*\]$')
RE_END_BLOCK = re.compile('^\[/[\w|\s]*\]$')

def iter_blocks(lines):
    block = []
    inblock = False
    for line in lines:
        if line.isspace():
            if inblock:
                block.append(line)
            elif block:
                yield block
                block = []
        else:
            if RE_START_BLOCK.match(line):
                inblock = True
            elif RE_END_BLOCK.match(line):
                inblock = False
            block.append(line.lstrip())
    if block:
        yield block