itertools.groupby

Wolfgang Maier wolfgang.maier at biologie.uni-freiburg.de
Mon Apr 22 09:17:43 EDT 2013


Jason Friedman <jsf80238 <at> gmail.com> writes:

> 
> Thank you for the responses!  Not sure yet which one I will pick.
> 

Hi again,
I was working a bit on my own solution and on the one from Steven/Joshua,
and maybe that helps you deciding:

def separate_on(iterable, separator):
    # based on groupby
    sep_len=len(separator)
    for is_header, item in groupby(iterable,
lambda line: line[:sep_len] == separator):
        if is_header:
            header_tails = [h[sep_len:].strip() for h in item]
            for naked_header in header_tails[:-1]:
                yield (naked_header,[])
            header_tail = header_tails[-1]
        else:
            try:
                yield (header_tail, [s.strip() for s in item])
            except UnboundLocalError:
                yield (None, [s.strip() for s in item])


def group(iterable, separator):
    # Steven's/Joshua's rewritten
    sep_len = len(separator)
    accum = None
    header = None
    for item in iterable:
        item = item.strip()
        if item[:sep_len] == separator:
            if accum is not None:
                # Don't bother if there are no accumulated lines.
                yield (header, accum)
            header = item[sep_len:]
            accum = []
        else:
            try:
                accum.append(item)
            except AttributeError:
                accum = [item]
                
    # Don't forget the last group of lines.
    yield (header, accum)

Both versions behave as follows:
- any line that *starts* with the separator is treated as a header line. The
tail of that line is returned as the groups title in a tuple with the
group's content, i.e. (header, [body]). If there's only the separator, the
title is ''. I find this a more useful behaviour as it allows things like:

##Group1
elem1
elem2
elem3
##Group2
a
b
c
...

- if there are headers without body, they are reported as (header, []).
- if the first body has no header, that's reported as (None, [body]).

Advantages & Disadvantages of either form:
Steven's/Joshua's: simple and fast
it's more readable I'd say, and
for small groups the groupby implementation is about 1.5x slower than this
one. The groupby version catches up with increasing group sizes (because it
uses comprehensions instead of list.append I think), but it only wins with
groups of ~1000 elements.

the groupby implementation: more flexible
its yield statement deliberately returns a list of the elements, but before
that you just have an iterator, which you could just as well turn into a
tuple, set, string or anything without constructing the list in memory.
So in terms of code recycling this might be preferable.

Cheers,
Wolfgang




More information about the Python-list mailing list