Fate of itertools.dropwhile() and itertools.takewhile()

Sun Dec 30 20:02:34 EST 2007

[Marc 'BlackJack' Rintsch]
> I use both functions from time to time.
> One "recipe" is extracting blocks from text files that are delimited by a
> special start and end line.
>
> def iter_block(lines, start_marker, end_marker):
>     return takewhile(lambda x: not x.startswith(end_marker),
>                      dropwhile(lambda x: not x.startswith(start_marker),
>                                lines))

Glad to hear this came from real code instead of being contrived for
this discussion.  Thanks for the contribution.

Looking at the code fragment, I wondered how that approach compared to
others in terms of being easy to write, self-evidently correct,
absence of awkward constructs, and speed.  The lambda expressions are
not as fast as straight C calls or in-lined code, and they also each
require a 'not' to invert the startswith condition.  The latter is a
bit problematic in that it is a bit awkward, and it is less self-
evident whether the lines with the markers are included or excluded
from the output (the recipe may in fact be buggy -- the line with the
start marker is included and the line with the end marker is
excluded). Your excellent choice of indentation helps improve the
readability of the nested takewhile/dropwhile calls.

In contrast, the generator version is clearer about whether the start
and end marker lines get included and is easily modified if you want
to change that choice.  It is easy to write and more self-evident
about how it handles the end cases.  Also, it avoids the expense of
the lambda function calls and the awkwardness of the 'not' to invert
the sense of the test:

    def iter_block(lines, start_marker, end_marker):
        inblock = False
        for line in lines:
            if inblock:
                if line.startswith(end_marker):
                    break
                yield line
            elif line.startswith(start_marker):
                yield line
                inblock = True

And, of course, for this particular application, an approach based on
regular expressions makes short work of the problem and runs very
fast:

    re.search('(^beginmark.*)^endmark', textblock, re.M |
re.S).group(1)

Raymond