[Python-Dev] "groupby" iterator

Sat Nov 29 00:34:01 EST 2003

[Guido, on grouping elements of a sequence by key]
> ...
> Or is there a more elegant approach than my original code that I've
> missed all these years?

I've always done it like:

    d = {}
    for x in sequence:
        d.setdefault(key(x), []).append(x)
    # Now d has partitioned sequence by key.  The keys are
    # available as d.keys(), the associated groups as d.values().
    # So, e.g.,
    for key, group in d.iteritems():
        d[key] = sum(group)

There's no code duplication, or warts for an empty sequence, which are the
ugly parts of the non-dict approach.  It doesn't matter here whether the
elements orginally appear with equal keys all adjacent, and input often
isn't sorted that way.  When it isn't, not needing to sort first can be a
major time savings if the sequence is big.  Against it, a dict is a large
data structure.  I don't think it's ever been a real problem that it
requires keys to be hashable.

groupby() looks very nice when it applies.

> ...
>   totals = {}
>   for key, group in groupby(keyfunc, sequence):
>       totals[key] = sum(group)

Or

    totals = dict((key, sum(group))
                  for key, group in groupby(keyfunc, sequence))

exploiting generator expressions too.

[after Raymond wonders about cases where the consumer doesn't
 iterate over the group generators
]

> I don't think those semantics should be implemented.  You should be
> required to iterate through each group.

Brrrr.  Sounds error-prone (hard to explain, and impossible to enforce
unless the implementation does almost all the work it would need to allow
groups to get skipped -- if the implementation can detect that a group
hasn't been fully iterated, then it could almost as easily go on to skip
over remaining equal keys itself instead of whining about it; but if the
implementation can't detect it, accidental violations of the requirement
will be hard to track down).

You're a security guy now.  You've got a log with line records of the form

    month  day  hhmmss  severity_level  threat_id

It's sorted ascending by month then desceding by severity_level.  You want a
report of the top 10 threats seen each month.

    for month, lines in groupby(lamdba s: s.split()[0], input_file):
        print month
        print itertools.islice(lines, 10)

Like array[:10], islice() does the right thing if there are fewer than 10
lines in a month.  It's just not natural to require that an iterator be run
to exhaustion (if it *were* natural, this wouldn't be the first context ever
to require it <wink>).