[Baypiggies] Notes on defaultdict

Fri Aug 27 13:00:44 CEST 2010

On Thu, Aug 26, 2010 at 11:16:22PM -0700, Glen Jarvis wrote:
>    >>> for color, value in input:
> 
>    ...     if output.has_key(color):
> 
>    ...         output[color].append(value)
> 
>    ...     else:
> 
>    ...         output[color] = [value]
> 
>    ...

Use 'color in output', avoid the has_key... if this is targeted as an 
intro to python, has_key should *not* be mentioned- it's slower than 
doing 'in' due to method lookup and it was punted from py3k.  Been 
frowned upon for a long while.

Teaching someone new to python, either you can tell them about 
list.index, dict.has_key, or you can tell them about the 'in' 
operator.  Expose new folk to just the latter, they have one rule to 
know- specifically knowing about the containment protocol.

This is generally the pythonic way- who cares if it's a dict or a list 
or a tuple, if it quacks like a duck (does containment), then treat it 
like a duck (it's sequence like).

>    What a default dictionary type is, it's simply a subclassed version of
>    the dictionary that does a few special things.
> 
>    You give it a basic 'factory function' -- in our case a list -- and
>    that's what's initialized for a key the first time it is accessed. We
>    don't have to worry if the key was previously initialized, etc..

Instantiated if the key is missing, rather than 'first time access'.  
It's pedantic correction, but the wording you're using here conflicts 
with-

d = defaultdict(list)
d[1].append(2)
del d[1]
print d[1]

>    Some will argue that this is less explicit -- and I would probably have
>    been one of them when I first started preparing this newbie nugget.
>    But, after doing over this a few times to explain it, I've finally
>    started to come on board to the idea. We're explicitly stating the
>    default type when we initialize the defaultdict variable.
> 
>    I especially like it that the default dict gives a default type for any
>    key -- not for a specific key as setdefault seems to do (correct me if
>    I misspeak here).

The phrasing/selling point here is questionable; how it reads to me is 
claiming that all key/vals wind up using that factory function 
(trying to sell the notion of 'default type' specifically), which 
isn't the case.  Might want to reword that.

As for setdefault, the phrasing there is further confusing.  
setdefault has no notion of default type... it's purely "if the key 
doesn't exist, use this default".

>    But, more than anything, it's *much faster* than doing it the other
>    ways.. But, I want you to prove that to yourself and to me… I have
>    created a 1 GB file with colors and numbers. I'll give you this file so
>    that you can read it in and create the dictionary of your choice, with
>    the method of your choice.

You really shouldn't be trying to sell defaultdict on it's speed since 
it's not much of a gurantee.

Consider the following:

python -m timeit -n 1000 -s 'from collections import defaultdict;' \
 $'d=defaultdict(lambda :[0])\nfor x in xrange(1000):d[x].append(1)'
1000 loops, best of 3: 794 usec per loop

# optimized variant.
python -m timeit -n 1000 -s 'from collections import defaultdict;from functools import partial' \
 $'d=defaultdict(partial(list, [0]))\nfor x in xrange(1000):d[x].append(1)'
1000 loops, best of 3: 723 usec per loop

python -m timeit -n 1000 -s 'pass' \
  $'d={}\nfor x in xrange(1000):d.setdefault(x, [0]).append(1)'
1000 loops, best of 3: 546 usec per loop

python -m timeit -n 1000 -s 'pass' \
 $'d={}\nfor x in xrange(1000):\n if x in d:d[x].append(1)\n else:d[x]=[0,1]'
1000 loops, best of 3: 282 usec per loop

To be clear, for the scenario of defaultdict(list), defaultdict *is* 
around 20% faster than the optimized dict equivalent (py2.6 timings 
specifically).

Where defaultdict breaks down is when instantiation is semi complex- 
if you have to rely on lambda for example, the frame overhead is going 
to add up pretty quick.   In addition, reliance on defaultdict means 
you lose the ability to do site specific faster versions ([0,1] for 
example).

Pretty much, if you're going to sell folk on defaultdict, sell them on 
code cleanliness:

# consider this:
d = {}
for key, (subkey, val) in source_items:
 if key in d:
  if subkey in d[key]:
   d[key][subkey].extend(subval)
  else:
   d[key][subkey] = list(subval)
 else:
  d[key] = {subkey:list(subval)}

# simplified...
d = {}
for key, (subkey, val) in source_items:
 d.setdefault(key, {}).setdefault(subkey, []).extend(val)

# or using defaultdict
d = defaultdict(partial(defaultdict, list))
for key, (subkey, val) in source_items:
 d[key][subkey].extend(val)

While that's an abbreviated example, if one imagines the logic being 
a fair bit more complex for assignment/mangling of subkey/val, the 
slightly tricky defaultdict setup used in the last example is offset 
in maintenance costs pretty quickly via not having to dink around 
with setdefault or equivalent logic.

Hell, consider the following-

def f():
 return defaultdict(f)

my_tree = f()
my_tree[1][2][3][4][5][6][7][8][9]
my_tree[1][3]

Stuff like that really is the power of defaultdict (more the power of 
__missing__, but you seem to be skipping that detail in your nugget 
thingy-mo-bob).

~harring
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/baypiggies/attachments/20100827/deaddabe/attachment.pgp>