itertools.groupby usage to get structured data

Fri Feb 4 21:27:43 EST 2011

On Fri, 04 Feb 2011 15:14:24 -0800, Slafs wrote:

> Hi there!
> 
> I'm having trouble to wrap my brain around this kind of problem:

Perhaps you should consider backing up and staring from somewhere else 
with different input data, or changing the requirements. Just a thought.

> What I have :
>   1) list of dicts
>   2) list of keys that i would like to be my grouping arguments of
> elements from 1)
>   3) list of keys that i would like do "aggregation" on the elements
> of 1) with some function e.g. sum

You start with data:

dicts = [ {'g1': 1, 'g2': 8, 's_v1': 5.0, 's_v2': 3.5},
          {'g1': 1, 'g2': 9, 's_v1': 2.0, 's_v2': 3.0}, 
          {'g1': 2, 'g2': 8, 's_v1': 6.0, 's_v2': 8.0} ]

It sometimes helps me to think about data structures by drawing them out. 
In this case, you have what is effectively a two-dimensional table:

g1  g2  s_v1  s_v2
=== === ===== ====
1   8   5.0   3.5
1   9   2.0   3.0 
2   8   6.0   8.0

Nice and simple. But the result you want is a bit more complex -- it's a 
dict of dicts of dicts:

{1: {'s_v1': 7.0, 's_v2': 6.5,
     'g2': {8: {'s_v1': 5.0, 's_v2': 3.5}, 
            9: {'s_v1': 2.0, 's_v2': 3.0}
           }},
 2: {'s_v1': 6.0, 's_v2': 8.0, 
    'g2': {8: {'s_v1' : 6.0, 's_v2': 8.0}
          }}}

(I quote from the Zen of Python: "Flat is better than nested." Hmmm.)

which is equivalent to a *four* dimensional table, which is a bit hard to 
write out :)

Here's a two-dimensional projection of a single slice with key = 1:

s_v1  s_v2  g2
===== ===== =====
7.0   6.5     | s_v1  s_v2
            ---------------
            8 | 5.0   3.5
            9 | 2.0   3.0

Does this help you to either (1) redesign your data structures, or (2) 
work out how to go from there?

[...]
> I was looking for a solution that would let me do that kind of grouping
> with variable lists of 2) and 3) i.e. having also 'g3' as grouping
> element so the 'g2' dicts could also have their own "subgroup" and be
> even more nested then. I was trying something with itertools.groupby and
> updating nested dicts, but as i was writing the code it started to feel
> too verbose to me :/

I don't think groupby is the tool you want. It groups *consecutive* items 
in sequences:

>>> from itertools import groupby
>>> for key, it in groupby([1,1,1,2,3,4,3,3,3,5,1]):
...     print(key, list(it))
...
1 [1, 1, 1]
2 [2]
3 [3]
4 [4]
3 [3, 3, 3]
5 [5]
1 [1]

Except for the name, I don't see any connection between this and what you 
want to do.

The approach I would take is a top-down approach:

dicts = [ ... ]  # list of dicts, as above.
result = {}
for d in dicts:
    # process each dict in isolation
    temp = process(d)
    merge(result, temp)

merge() hopefully should be straight forward, and process only needs to 
look at one dict at a time.

-- 
Steven