Candidate for a new itertool

Mon Mar 9 11:26:23 EDT 2009

On Mar 7, 8:47 pm, Raymond Hettinger <pyt... at rcn.com> wrote:
> The existing groupby() itertool works great when every element in a
> group has the same key, but it is not so handy when groups are
> determined by boundary conditions.
>
> For edge-triggered events, we need to convert a boundary-event
> predicate to groupby-style key function.  The code below encapsulates
> that process in a new itertool called split_on().
>
> Would love you guys to experiment with it for a bit and confirm that
> you find it useful.  Suggestions are welcome.
>
> Raymond
>
> -----------------------------------------
>
> from itertools import groupby
>
> def split_on(iterable, event, start=True):
>     'Split iterable on event boundaries (either start events or stop
> events).'
>     # split_on('X1X23X456X', 'X'.__eq__, True)  --> X1 X23 X456 X
>     # split_on('X1X23X456X', 'X'.__eq__, False) --> X 1X 23X 456X
>     def transition_counter(x, start=start, cnt=[0]):
>         before = cnt[0]
>         if event(x):
>             cnt[0] += 1
>         after = cnt[0]
>         return after if start else before
>     return (g for k, g in groupby(iterable, transition_counter))
>
> if __name__ == '__main__':
>     for start in True, False:
>         for g in split_on('X1X23X456X', 'X'.__eq__, start):
>             print list(g)
>         print
>
>     from pprint import pprint
>     boundary = '--===============2615450625767277916==\n'
>     email = open('email.txt')
>     for mime_section in split_on(email, boundary.__eq__):
>         pprint(list(mime_section, 1, None))
>         print '= = ' * 30

Sorry to hijack the thread but I now that you have a knack for finding
good iterator patterns. I have noticed a pattern lately: Aggregation
using a defaultdict. I quickly found two examples of problems that
could use this:

http://groups.google.com/group/comp.lang.python/browse_frm/thread/c8b3976ec3ceadfd

http://www.willmcgugan.com/blog/tech/2009/1/17/python-coder-test/

To show an example, using data like this:

>>> data=[('red',2,'other data'),('blue',5,'more data'),('yellow',3,'lots of things'),('blue',1,'data'),('red',2,'random data')]

Then
>>> from itertools import groupby
>>> from operator import itemgetter
>>> from collections import defaultdict

We can use groupby to do this:
>>> [(el[0],sum(x[1] for x in el[1])) for el in groupby(sorted(data,key=itemgetter(0)),itemgetter(0))]
[('blue', 6), ('red', 4), ('yellow', 3)]

>>> [(el[0],[x[1] for x in el[1]]) for el in groupby(sorted(data,key=itemgetter(0)),itemgetter(0))]
[('blue', [5, 1]), ('red', [2, 2]), ('yellow', [3])]

>>> [(el[0],set([x[1] for x in el[1]])) for el in groupby(sorted(data,key=itemgetter(0)),itemgetter(0))]
[('blue', set([1, 5])), ('red', set([2])), ('yellow', set([3]))]

But this way seems to be more efficient:

>>> def aggrsum(data,key,agrcol):
	dd=defaultdict(int)
	for el in data:
		dd[key(el)]+=agrcol(el)
	return dd.items()

>>> aggrsum(data,itemgetter(0),itemgetter(1))
[('blue', 6), ('yellow', 3), ('red', 4)]

>>> def aggrlist(data,key,agrcol):
	dd=defaultdict(list)
	for el in data:
		dd[key(el)].append(agrcol(el))
	return dd.items()

>>> aggrlist(data,itemgetter(0),itemgetter(1))
[('blue', [5, 1]), ('yellow', [3]), ('red', [2, 2])]

>>> def aggrset(data,key,agrcol):
	dd=defaultdict(set)
	for el in data:
		dd[key(el)].add(agrcol(el))
	return dd.items()

>>> aggrset(data,itemgetter(0),itemgetter(1))
[('blue', set([1, 5])), ('yellow', set([3])), ('red', set([2]))]

The data often contains objects with attributes instead of tuples, and
I expect the new namedtuple datatype to be used also as elements of
the list to be processed.

But I haven't found a nice generalized way for that kind of pattern
that aggregates from a list of one datatype to a list of key plus
output datatype that would make it practical and suitable for
inclusion in the standard library.