filter()ing a dict

Thu Aug 7 10:57:32 EDT 2003

Robin Cull wrote:

> Imagine I have a dict looking something like this:
> 
> myDict = {"key 1": ["value 1", "value 2", "value 3", "value 4"], "key
> 2": ["value 1", "value 2"], "key 3": ["value2", "value 3", "value 4"],
> "key 4": ["value 1", "value 2", "value 3", "value 4"]}
> 
> That is, a set of keys which have a variable length list of associated
> values after them.  What I want to do is filter out a subset of this
> dict to produce another dict that satisfies a set of criteria (in this
> case whether it contains all four values) to end up with something
> like this:
> 
> {"key 1": ["value 1", "value 2", "value 3", "value 4"], "key 4":
> ["value 1", "value 2", "value 3", "value 4"]}
> 
> I've tried using the filter() function so I wrote a small function
> that returned True or False depending on whether the length of the
> list was what I wanted e.g.
> 
> def f(item):
>     if len(item) == 4:
>         return True
>     else:
>         return False

This function expects to be passed the value (not the item).

> I then called filter(f, myDict).

While this passes the keys (not the items, but not the values either).

> I was expecting to be returned a dict which was the subset of myDict
> that satisfied f.
> 
> This did not happen; the item passed to f was each key of the dict,
> not each value list.  The call to filter returned an empty list (as
> obviously none of the keys satisfied the criteria).  If anything I
> would have expected an empty dict back as according to the docs:
> 
>     ""filter(function, sequence)" returns a sequence (of the same
> type, if possible) consisting of those items from the sequence for
> which function(item) is true."
> 
> The important bit here is "of the same type, if possible" which I
> understood to mean if passed a dict it would return a dict.

Please check the current docs at
http://www.python.org/doc/current/lib/built-in-funcs.html
and let us know if you consider them clearer now.  Specifically,
the sentence:
"""
If <i>list</i> is a string or a tuple, the result also has that type;
otherwise it is always a list.
"""
should help (despite the unfortunate name chosen for the argument).

> I've tried calling filter(f, myDict.values()) which works and I get a
> list like this returned:
> 
> [["value 1", "value 2", "value 3", "value 4"], ["value 1", "value 2",
> "value 3", "value 4"]]
> 
> Sort of what I want but I've lost the associativity with the key which
> was kind of the point in using a dict in the first place.

Exactly.  It's impossible to reconstruct the right dict from this one.

> I'm guessing that filter() does not work on a dict in the way I think
> it should.

Doesn't and never will.

> I can write something to get around the behaviour in this particular
> case (although suggestions would be welcome).

You need to filter on items, which are key-value pairs; therefore:

def f(item):
    return len(item[1]) == 4

the value is the 1-th element of the 'item' pair (tuple) while the
key is the 0-th element.  I've also simplified the logical structure,
since the == already returns True or False just as you wish, there
is no reason to keep an if/else (though it's just redundant, not
directly harmful).

Then, you need to provide to filter the items, explicitly:

    filter(f, myDict.items())

this gives you a list of items (pairs of key-value).  Finally you
can pass that to dict to obtain what you want:

    dict(filter(f, myDict.items()))

You could equivalently use myDict.iteritems() [may be faster, but
I doubt that matters for a reasonably small dictionary anyway].

Note that filter, these days, is often shunned in favour of a
list-comprehension, which saves the need to write a function as
the predicate for the filtering; with a list comprehension you
might write it all on one line as

dict([ (key,myDict[key]) for key in myDict if len(myDict[key])==4 ])

or, if you prefer:

dict([ (key,value) for key,value in myDict.items() if len(value)==4 ])

and other equivalent ways yet.

> My question is, does anybody else think that when calling filter(f,
> dict), it should return a dict of the keys and values of the former
> dict where the values satisfy the conditions defined by f()?

According to "Ugol's Law", on Usenet, if you ever ask whether you
are the only one who "X" (where "X" is "thinks this", "does that",
"likes the other", etc, etc), the answer is invariably "no".

However, the Ugolian certainty that there must be another person
who shares your opinion in this matter isn't all that relevant.
More interesting is the next question you ask:

> Are there reasons for it not to work this way?  I must admit I've not
> looked into this carefully, just in the context of my script it'd be
> useful to.

'filter' unfortunately carries around the complication (for backwards
compatibility) of returning a string when passed a string, a tuple
when passed a tuple, a list in all other cases.  It would obviously
be much simpler if it always returned a list.  You, on the other hand,
would want to make it even _more_ complicated -- and for reasons that
escape me, but perhaps connected at looking just at the context of
this one script, would want the filtering function to be passed just
the values, rather than (as happens now if you pass filter a dict)
just the keys.  Since you CAN recover the value from the key (if you
have access to the dictionary, e.g. as a global variable) but not
vice-versa, I think it's undisputable that such a hypothetical change
(hypothetical because it would break backwards compatibility, and
thus it's mathematically certain that it will not happen) would be
worse than the current situation.  Passing a (key,value) pair, i.e
a whole item, would on the other hand not be just as unambiguously
bad -- it would "just" be a complication whose cost (complicating
things) needs to be weighed against its advantage (making it a tad
handier to perform dictionary-filtering in certain cases) - again
hypothetically speaking, of course.

> What are the side effects of making filter() work this way?

Presumably that filter then would need to know about one more
special case (besides the two unfortunate, historical ones of
strings and tuples) rather than just being able to iterate on
its second argument.  The incompatibility of the filtering function
needing to be passed the whole item [ (key,value) pair ] is the
definite reason this will never happen; whether it should have
been so in the first place is therefore moot (I opine filter's
very existence is unjustified in today's Python, except for
backwards compatibility, but others do differ, of course).

Python 3.0's main thrust (when it finally happens) will be to
*SIMPLIFY* Python by removing a lot of currently existing
basically redundant ways to do things -- the BDFL has said so
at this year's Python conferences.  Thus, I expect the filter
function to go away, rather than gaining bells and whistles.

If you can make substantial use cases for the "filtering a
dictionary" functionality, for very large dictionaries where
dict([ (k,d[k]) for k in d if len(d[k])==4 ]) has intolerable
overhead, then one addition that _might_ have some (small)
chance of making it into future releases might be to add to
the dict class a filter method (taking just the predicate as
its argument).  I'm still rather dubious that this would
actually stand any real chance, but can't be certain of that.

Alex