[Python-ideas] Fwd: grouping / dict of lists

Fri Jul 13 13:38:44 EDT 2018

Thanks for linking to these. I looked at many of them in my own research,
but for some reason didn't think to write down the links. I'll respond to
each one separately.

Throughout, I'm going to use my proposed ``grouped`` builtin to demonstrate
possible revisions. Note that I am *not* suggesting a replacement to
defaultdict. The proposal is to make a common task easier and more
reliable. It does not satisfy all uses of defaultdict.

https://github.com/selik/peps/blob/master/pep-9999.rst

On Fri, Jul 13, 2018 at 6:17 AM Nicolas Rolin <nicolas.rolin at tiime.fr>
wrote:

> I noticed recently that *all* examples for collection.defaultdict (
> https://docs.python.org/3.7/library/collections.html#collections.defaultdict)
> are cases of grouping (for an int, a list and a set) from an iterator with
> a key, value output.
>
> I wondered how common those constructions were, and what are defaultdict
> used for else. So I took a little dive into a few libs to see it (std lib,
> pypy, pandas, tensorflow, ..), and I saw essentially :
>

> A) basic cases of "grouping" with a simple for loop and a
> default_dict[key].append(value). I saw many kind of default factory
> utilized, with list, int, set, dict, and even defaultdict(list). ex :
> https://frama.link/UtNqvpvb,
>

    facesizes = collections.defaultdict(int)
    for face in cubofaces:
        facesizes[len(face)] += 1

This is more specifically a Counter. It might look better as:

    facesizes = collections.Counter(len(face) for face in cubofaces)

https://frama.link/o3Hb3-4U,
>

    accum = defaultdict(list)
    garbageitems = []

    for item in root:
        filename = findfile(opts.fileroot, item.attrib['classname'])
        accum[filename].append(float(item.attrib['time']))
        if filename is None:
            garbageitems.append(item)

This might be more clear if separated into two parts.

    def keyfunc(item):
        return findfile(opts.fileroot, item.attrib['classname'])
    groups = grouped(root, keyfunc)
    groups = {k: [float(v.attrib['time']) for v in g] for k, g in
groups.items()}
    garbage = groups.pop(None, [])

Interestingly, this code later makes a distinction between ``garbage`` and
``garbageitems`` even though they appear to be the same. The programmer
might not have made that error (?) if using the ``grouped`` function,
because the act of grouping becomes single-purpose.

> https://frama.link/dw92yJ1q
>

In this case, the grouping is being performed by Pandas, not by the
defaultdict.

    expected = defaultdict(dict)
    for n1, gp1 in data.groupby('A'):
        for n2, gp2 in gp1.groupby('B'):
            expected[n1][n2] = op(gp2.loc[:, ['C', 'D']])

However, this is still relevant to our discussion, because the defaultdict
had to be converted to a regular dict afterward. Presumably to ensure that
missing keys cause KeyErrors.

    expected = dict((k, DataFrame(v))
                    for k, v in compat.iteritems(expected))

If you felt like one-lining this, it might look like

    expected = {n1: DataFrame({n2: op(gp2.loc[:, ['C', 'D']]) for n2, gp2
in gp1}) for n1, gp1 in data.groupby('A')}

I'm not saying that's more clear, but the fact that's possible emphasizes
that the grouping is performed by Pandas. Grouping cannot be one-lined if
restricted to Python standard library.

> https://frama.link/1Gqoa7WM
>

    results = defaultdict(dict)
    for i, k1 in enumerate(arg1.columns):
        for j, k2 in enumerate(arg2.columns):
            if j < i and arg2 is arg1:
                # Symmetric case
                results[i][j] = results[j][i]
            else:
                results[i][j] = f(*_prep_binary(arg1.iloc[:, i],
                                                arg2.iloc[:, j]))

The nested loop looks like a cross-product. It might be clearer using
itertools, but I wouldn't use ``grouped`` here.

    cross = itertools.product(enumerate(arg1.columns),
enumerate(arg2.columns))

> https://frama.link/bWswbHsU,
>

    self.mapping = collections.defaultdict(set)
    for op in (op for op in graph.get_operations()):
      if op.name.startswith(common.SKIPPED_PREFIXES):
        continue
      for op_input in op.inputs:
        self.mapping[op_input].add(op)

This is a case of a single element being added to multiple groups, which is
your section B, below. The loop and filter could be better. It looks like
someone intended to convert if/continue to a comprehension, but stopped
partway through the revision.

    ops = (op for op in graph.get_operations() if not
op.name.startswith(common.SKIPPED_PREFIXES))

I'm not sure how I like it, but here's a revision to use ``grouped``:

    inputs = ((op_input, op) for op in ops for op_input in op.inputs)
    groups = grouped(inputs, key=itemgetter(0))
    groups = {k: set(v for _, v in g) for k, g in groups.items()}

https://frama.link/SZh2q8pS
>

    handler_results = collections.defaultdict(list)
    for handler in per_handler_ops.keys():
        for op in per_handler_ops[handler]:
            handler_results[handler].append(op_results[op])
    return handler_results

Here it looks like the results are already grouped by handler and this loop
is constructing a parallel dict of lists for the results instead of the ops
themselves.

    return {handler: [op_results[op] for op in group] for handler, group in
per_handler_ops.items()}

B) cases of grouping, but where the for loop used was alimenting more than
> one "grouper". pretty annoying if we want to group something. ex:
> https://frama.link/Db-Ny49a,
>

    def _get_headnode_dict(fixer_list):
        """ Accepts a list of fixers and returns a dictionary
            of head node type --> fixer list.  """
        head_nodes = collections.defaultdict(list)
        every = []
        for fixer in fixer_list:
            if fixer.pattern:
                try:
                    heads = _get_head_types(fixer.pattern)
                except _EveryNode:
                    every.append(fixer)
                else:
                    for node_type in heads:
                        head_nodes[node_type].append(fixer)
            else:
                if fixer._accept_type is not None:
                    head_nodes[fixer._accept_type].append(fixer)
                else:
                    every.append(fixer)
        for node_type in chain(pygram.python_grammar.symbol2number.values(),
                               pygram.python_grammar.tokens):
            head_nodes[node_type].extend(every)
        return dict(head_nodes)

Again this ends with a conversion from defaultdict to basic dict: ``return
dict(head_nodes)``. This is somewhat common in usage of defaultdict,
because after the creation of groups, the later usage wants missing keys to
raise KeyError, not insert a new, empty group.

The association of a single element to multiple groups makes this a little
awkward again.

    def _get_headnode_dict(fixer_list):
        """ Accepts a list of fixers and returns a dictionary
            of head node type --> fixer list.  """

        nodetypes = set(chain(pygram.python_grammar.symbol2number.values(),
                              pygram.python_grammar.tokens))

        heads = {}
        for fixer in fixer_list:
            if fixer.pattern:
                try:
                    heads[fixer] = _get_head_types(fixer.pattern)
                except _EveryNode:
                    heads[fixer] = nodetypes
            elif fixer._accept_type is not None:
                heads[fixer] = [fixer._accept_type]
            else:
                heads[fixer] = nodetypes

        pairs = ((head, fixer) for fixer, types in heads.items() for head
in types)
        groups = grouped(pairs, key=itemgetter(0))
        return {head: [fixer for _, fixer in g] for head, g in
groups.items()}

https://frama.link/bZakUR33,
>

    directories = defaultdict(set)
    cache_directories = defaultdict(set)
    groups = defaultdict(list)
    for source, target, group, disk_id, condition in files:
        target = PureWindowsPath(target)
        groups[group].append((source, target, disk_id, condition))

        if target.suffix.lower() in {".py", ".pyw"}:
            cache_directories[group].add(target.parent)

        for dirname in target.parents:
            parent = make_id(dirname.parent)
            if parent and parent != '.':
                directories[parent].add(dirname.name)

This one has 3 different groupings created simultaneously. We could
separate the 3 tasks, but they all rely on a converted ``target`` path, so
it'd really end up being 4 loops. At that point, perhaps it's best to do
just 1 loop as written.

> https://frama.link/MwJFqh5o
>

    paramdicts = {}
    testers = collections.defaultdict(list)
    for name, attr in cls.__dict__.items():
        if name.endswith('_params'):
            if not hasattr(attr, 'keys'):
                d = {}
                for x in attr:
                    if not hasattr(x, '__iter__'):
                        x = (x,)
                    n = '_'.join(str(v) for v in x).replace(' ', '_')
                    d[n] = x
                attr = d
            paramdicts[name[:-7] + '_as_'] = attr
        if '_as_' in name:
            testers[name.split('_as_')[0] + '_as_'].append(name)

This one isn't a multiple groups issue. We can separate the tasks of
paramdicts and testers.

    as_names = (name for name in cls.__dict__ if '_as_' in name)
    testers = grouped(as_names, key=lambda name: name.split('_as_')[0] +
'_as_')

The paramdicts isn't a dict of lists, so it's not really appropriate for
``grouped``.

>
> C) classes attributes initialization (grouping is done by repeatably
> calling a function, so any grouping constructor will be useless here). ex :
> https://frama.link/GoGWuQwR,
>

Instead of ``defaultdict(lambda: 0)`` it seems better to use ``Counter()``.
It's called "invocation_counts" after all.

> https://frama.link/BugcS8wU
>

This seems like one of the ideal uses for defaultdict. It even has bits of
code made to handle the possibly accidental insert of new keys by popping
them off afterwards.

D) Sometimes you just want to defautdict inside a defauldict inside a dict
> and just have fun : https://frama.link/asBNLr1g,
>

I only have a tenuous grasp of git, so the fact that this file isn't in the
master branch is stretching my mind. But again, this doesn't feel like a
grouping task to me. It probably could be rewritten as such, but not
trivially.

>  https://frama.link/8j7gzfA5
>

It's hard to tell at a glance if this should truly be a dict of dicts of
sets or a dict of sets where the keys are 2-tuples. I'll assume the former
is correct.

On a side note, it looks like they're doing some silly things to fit into
the 80 characters per line restriction, like looping over keys and looking
up values inside the loop instead of using ``.items()``. I understand that
the character limit is strict, but...

https://github.com/tensorflow/tensorflow/blob/b202db076ecdc8a0fc80a49252d4a8dd0c8015b3/tensorflow/python/debug/lib/debug_data.py#L1371

> From what I saw, the most useful would be to add method to a defaultdict
> to fill it from an iterable, and using a grouping method adapted to the
> default_factor (so __add__ for list, int and str, add for set, update for
> dict and proably __add__ for anything else)
>
> A sample code would be :
>
> from collections import defaultdict
> class groupingdict(defaultdict):
>     def group_by_iterator(self, iterator):
>         empty_element = self.default_factory()
>         if hasattr(empty_element, "__add__"):
>             for key, element in iterator:
>                 self[key] += element
>         elif hasattr(empty_element, "update"):
>             for key, element in iterator:
>                 self[key].update(element)
>         elif hasattr(empty_element, "add"):
>             for key, element in iterator:
>                 self[key].add(element)
>         else:
>             raise TypeError('default factory does not support iteration')
>         return self
>
> So that for example :
> >groupingdict(dict).group_by_iterator(
>     (grouping_key, a_dict) for grouping_key, a_dict in [
>         (1, {'a': 'c'}),
>         (1, {'b': 'f'}),
>         (1, {'a': 'e'}),
>         (2, {'a': 'e'})
>     ]
> )
> returns
>
> >groupingdict(dict, {1: {'a': 'e', 'b': 'f'}, 2: {'a': 'e'}})
>
>
> My implementation is garbage and There should be 2 method, one returning
> the object and one modifing it, but I think it gives more leeway than just
> a function returning a dict
>

Let's take a look at these two examples:

    grouped(contacts, key=itemgetter('city')
    grouped(os.listdir('.'), key=lambda filepath:
os.path.splitext(filepath)[1])

How would those be rewritten using your proposal?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20180713/2787b536/attachment-0001.html>