[Python-ideas] Proposal : allowing grouping by relation

Andrew Barnert abarnert at yahoo.com
Sat Aug 23 23:57:46 CEST 2014


On Aug 23, 2014, at 12:23, Yotam Vaknin <tomirendo at gmail.com> wrote:

> Hi,
> 
> I am using groupby (from itertools) to group objects by a key. It would be very useful for me to be able to group objects by the relation of two consecutive  objects or by an object relation to the first object in the current group.
> 
> I think it should be done by adding a "relation" keyword to the function, that accept two argument functions that return true or false.

This _should be_ easy to write as a wrapper around groupby with a key that checks your relation.

But there's one problem: groupby checks the _first_ key in a group against each new key, instead of the most recent one.

I wrote a blog post last year about this (http://stupidpythonideas.blogpost.com/2014/01/grouping-into-runs-of-adjacent-values.html). It turns our to be pretty easy if your relation is symmetric, but only one of the obvious ways to do it actually works.

Anyway, it might be worth changing groupby so it never compares x==y instead of y==x, and making the C implementation and the Python equivalent in the docs actually equivalent.

Beyond that, I think it might make sense to add a relation_to_key function and/or to change cmp_to_key so it's directly usable with groupby.

Then, it should be possible to make groupby_relation into a 3-line wrapper around groupby, in which case I think it might be better as a recipe (and submitted to more_itertools on PyPI) than to add it to itertools itself.

> It would also be useful to create a function that enable easily creating relation functions. (Like attrgetter does for keys)
> ls = "aaabcdddefgjklm"
> groupby(ls, relation=difference(3,key = ord)) 
> #[['a', 'a', 'a', 'b', 'c', 'd', 'd', 'd', 'e', 'f', 'g'], ['j', 'k', 'l', 'm']]
> 
> I think in this case the function won't return a key-group tuple, but just a group iterable.

The key actually can be useful here. You can use it as a label for the "column". Especially if you've written your key function so it keeps track of both the first and most recent values, instead of just the most recent, so you can label it "a-_", where that _ is the current value at any given point, and the last value once you've consumed the group iterator. Sure, you _could_ recover that information from the group itself if you need it, but isn't it even easier to discard it if you don't need it?

> 
> This is already very useful for me, to group event objects in a list if they are close enough in time.
> 
> I wrote most of what I had in mind here:
> https://github.com/tomirendo/Grouper
> 
> 
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20140823/a6c18e07/attachment-0001.html>


More information about the Python-ideas mailing list