[Baypiggies] Notes on defaultdict
Brian Harring
ferringb at gmail.com
Fri Aug 27 13:00:44 CEST 2010
On Thu, Aug 26, 2010 at 11:16:22PM -0700, Glen Jarvis wrote:
> >>> for color, value in input:
>
> ... if output.has_key(color):
>
> ... output[color].append(value)
>
> ... else:
>
> ... output[color] = [value]
>
> ...
Use 'color in output', avoid the has_key... if this is targeted as an
intro to python, has_key should *not* be mentioned- it's slower than
doing 'in' due to method lookup and it was punted from py3k. Been
frowned upon for a long while.
Teaching someone new to python, either you can tell them about
list.index, dict.has_key, or you can tell them about the 'in'
operator. Expose new folk to just the latter, they have one rule to
know- specifically knowing about the containment protocol.
This is generally the pythonic way- who cares if it's a dict or a list
or a tuple, if it quacks like a duck (does containment), then treat it
like a duck (it's sequence like).
> What a default dictionary type is, it's simply a subclassed version of
> the dictionary that does a few special things.
>
> You give it a basic 'factory function' -- in our case a list -- and
> that's what's initialized for a key the first time it is accessed. We
> don't have to worry if the key was previously initialized, etc..
Instantiated if the key is missing, rather than 'first time access'.
It's pedantic correction, but the wording you're using here conflicts
with-
d = defaultdict(list)
d[1].append(2)
del d[1]
print d[1]
> Some will argue that this is less explicit -- and I would probably have
> been one of them when I first started preparing this newbie nugget.
> But, after doing over this a few times to explain it, I've finally
> started to come on board to the idea. We're explicitly stating the
> default type when we initialize the defaultdict variable.
>
> I especially like it that the default dict gives a default type for any
> key -- not for a specific key as setdefault seems to do (correct me if
> I misspeak here).
The phrasing/selling point here is questionable; how it reads to me is
claiming that all key/vals wind up using that factory function
(trying to sell the notion of 'default type' specifically), which
isn't the case. Might want to reword that.
As for setdefault, the phrasing there is further confusing.
setdefault has no notion of default type... it's purely "if the key
doesn't exist, use this default".
> But, more than anything, it's *much faster* than doing it the other
> ways.. But, I want you to prove that to yourself and to me… I have
> created a 1 GB file with colors and numbers. I'll give you this file so
> that you can read it in and create the dictionary of your choice, with
> the method of your choice.
You really shouldn't be trying to sell defaultdict on it's speed since
it's not much of a gurantee.
Consider the following:
python -m timeit -n 1000 -s 'from collections import defaultdict;' \
$'d=defaultdict(lambda :[0])\nfor x in xrange(1000):d[x].append(1)'
1000 loops, best of 3: 794 usec per loop
# optimized variant.
python -m timeit -n 1000 -s 'from collections import defaultdict;from functools import partial' \
$'d=defaultdict(partial(list, [0]))\nfor x in xrange(1000):d[x].append(1)'
1000 loops, best of 3: 723 usec per loop
python -m timeit -n 1000 -s 'pass' \
$'d={}\nfor x in xrange(1000):d.setdefault(x, [0]).append(1)'
1000 loops, best of 3: 546 usec per loop
python -m timeit -n 1000 -s 'pass' \
$'d={}\nfor x in xrange(1000):\n if x in d:d[x].append(1)\n else:d[x]=[0,1]'
1000 loops, best of 3: 282 usec per loop
To be clear, for the scenario of defaultdict(list), defaultdict *is*
around 20% faster than the optimized dict equivalent (py2.6 timings
specifically).
Where defaultdict breaks down is when instantiation is semi complex-
if you have to rely on lambda for example, the frame overhead is going
to add up pretty quick. In addition, reliance on defaultdict means
you lose the ability to do site specific faster versions ([0,1] for
example).
Pretty much, if you're going to sell folk on defaultdict, sell them on
code cleanliness:
# consider this:
d = {}
for key, (subkey, val) in source_items:
if key in d:
if subkey in d[key]:
d[key][subkey].extend(subval)
else:
d[key][subkey] = list(subval)
else:
d[key] = {subkey:list(subval)}
# simplified...
d = {}
for key, (subkey, val) in source_items:
d.setdefault(key, {}).setdefault(subkey, []).extend(val)
# or using defaultdict
d = defaultdict(partial(defaultdict, list))
for key, (subkey, val) in source_items:
d[key][subkey].extend(val)
While that's an abbreviated example, if one imagines the logic being
a fair bit more complex for assignment/mangling of subkey/val, the
slightly tricky defaultdict setup used in the last example is offset
in maintenance costs pretty quickly via not having to dink around
with setdefault or equivalent logic.
Hell, consider the following-
def f():
return defaultdict(f)
my_tree = f()
my_tree[1][2][3][4][5][6][7][8][9]
my_tree[1][3]
Stuff like that really is the power of defaultdict (more the power of
__missing__, but you seem to be skipping that detail in your nugget
thingy-mo-bob).
~harring
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/baypiggies/attachments/20100827/deaddabe/attachment.pgp>
More information about the Baypiggies
mailing list