Candidate for a new itertool
pruebauno at latinmail.com
pruebauno at latinmail.com
Mon Mar 9 11:26:23 EDT 2009
On Mar 7, 8:47 pm, Raymond Hettinger <pyt... at rcn.com> wrote:
> The existing groupby() itertool works great when every element in a
> group has the same key, but it is not so handy when groups are
> determined by boundary conditions.
>
> For edge-triggered events, we need to convert a boundary-event
> predicate to groupby-style key function. The code below encapsulates
> that process in a new itertool called split_on().
>
> Would love you guys to experiment with it for a bit and confirm that
> you find it useful. Suggestions are welcome.
>
> Raymond
>
> -----------------------------------------
>
> from itertools import groupby
>
> def split_on(iterable, event, start=True):
> 'Split iterable on event boundaries (either start events or stop
> events).'
> # split_on('X1X23X456X', 'X'.__eq__, True) --> X1 X23 X456 X
> # split_on('X1X23X456X', 'X'.__eq__, False) --> X 1X 23X 456X
> def transition_counter(x, start=start, cnt=[0]):
> before = cnt[0]
> if event(x):
> cnt[0] += 1
> after = cnt[0]
> return after if start else before
> return (g for k, g in groupby(iterable, transition_counter))
>
> if __name__ == '__main__':
> for start in True, False:
> for g in split_on('X1X23X456X', 'X'.__eq__, start):
> print list(g)
> print
>
> from pprint import pprint
> boundary = '--===============2615450625767277916==\n'
> email = open('email.txt')
> for mime_section in split_on(email, boundary.__eq__):
> pprint(list(mime_section, 1, None))
> print '= = ' * 30
Sorry to hijack the thread but I now that you have a knack for finding
good iterator patterns. I have noticed a pattern lately: Aggregation
using a defaultdict. I quickly found two examples of problems that
could use this:
http://groups.google.com/group/comp.lang.python/browse_frm/thread/c8b3976ec3ceadfd
http://www.willmcgugan.com/blog/tech/2009/1/17/python-coder-test/
To show an example, using data like this:
>>> data=[('red',2,'other data'),('blue',5,'more data'),('yellow',3,'lots of things'),('blue',1,'data'),('red',2,'random data')]
Then
>>> from itertools import groupby
>>> from operator import itemgetter
>>> from collections import defaultdict
We can use groupby to do this:
>>> [(el[0],sum(x[1] for x in el[1])) for el in groupby(sorted(data,key=itemgetter(0)),itemgetter(0))]
[('blue', 6), ('red', 4), ('yellow', 3)]
>>> [(el[0],[x[1] for x in el[1]]) for el in groupby(sorted(data,key=itemgetter(0)),itemgetter(0))]
[('blue', [5, 1]), ('red', [2, 2]), ('yellow', [3])]
>>> [(el[0],set([x[1] for x in el[1]])) for el in groupby(sorted(data,key=itemgetter(0)),itemgetter(0))]
[('blue', set([1, 5])), ('red', set([2])), ('yellow', set([3]))]
But this way seems to be more efficient:
>>> def aggrsum(data,key,agrcol):
dd=defaultdict(int)
for el in data:
dd[key(el)]+=agrcol(el)
return dd.items()
>>> aggrsum(data,itemgetter(0),itemgetter(1))
[('blue', 6), ('yellow', 3), ('red', 4)]
>>> def aggrlist(data,key,agrcol):
dd=defaultdict(list)
for el in data:
dd[key(el)].append(agrcol(el))
return dd.items()
>>> aggrlist(data,itemgetter(0),itemgetter(1))
[('blue', [5, 1]), ('yellow', [3]), ('red', [2, 2])]
>>> def aggrset(data,key,agrcol):
dd=defaultdict(set)
for el in data:
dd[key(el)].add(agrcol(el))
return dd.items()
>>> aggrset(data,itemgetter(0),itemgetter(1))
[('blue', set([1, 5])), ('yellow', set([3])), ('red', set([2]))]
The data often contains objects with attributes instead of tuples, and
I expect the new namedtuple datatype to be used also as elements of
the list to be processed.
But I haven't found a nice generalized way for that kind of pattern
that aggregates from a list of one datatype to a list of key plus
output datatype that would make it practical and suitable for
inclusion in the standard library.
More information about the Python-list
mailing list