Adding tuples to a dictionary

Thu May 31 16:06:07 EDT 2007

On May 31, 8:30 pm, Maciej Bliziński <maciej.blizin... at gmail.com>
wrote:
> Hi Pythonistas!
>
> I've got a question about storing tuples in a dictionary. First, a
> small test case which creates a list of dictionaries:
>
> import time
>
> list_of_dicts = []
> keys = [str(x) for x in range(200000)]
> prev_clk = time.clock()
> for i in range(20):
>     my_dict = {}
>     for key in keys:
>         my_dict[key] = key
>         list_of_dicts.append(my_dict)
>     new_clk = time.clock()
>     print i, new_clk - prev_clk
>     prev_clk = new_clk
>
> It creates dictionaries and stores them in a list, printing out
> execution times. The size of each dictionary is constant, so is the
> execution time for each iteration.
>
> 0 0.1
> 1 0.1
> 2 0.1
> 3 0.08
> 4 0.09
>
> ...and so on.
>
> Then, just one line is changed:
>         my_dict[key] = key
> into:
>         my_dict[key] = (key, key)
>
> Full code:
>
> list_of_dicts = []
> keys = [str(x) for x in range(200000)]
> prev_clk = time.clock()
> for i in range(20):
>     my_dict = {}
>     for key in keys:
>         my_dict[key] = (key, key)
>     list_of_dicts.append(my_dict)
>     new_clk = time.clock()
>     print i, new_clk - prev_clk
>     prev_clk = new_clk
>
> The difference is that instead of single values, tuples are added to
> the dictionary instead. When the program is  run again:
>
> 0 0.27
> 1 0.37
> 2 0.49
> 3 0.6
> ...
> 16 2.32
> 17 2.45
> 18 2.54
> 19 2.68
>
> The execution time is rising linearly with every new dictionary
> created.
>
> Next experiment: dictionaries are not stored in a list, they are just
> left out when an iteration has finished. It's done by removing two
> lines:
>
> list_of_dicts = []
>
> and
>
> list_of_dicts.append(my_dict)
>
> Full code:
>
> keys = [str(x) for x in range(200000)]
> prev_clk = time.clock()
> for i in range(20):
>     my_dict = {}
>     for key in keys:
>         my_dict[key] = (key, key)
>     new_clk = time.clock()
>     print i, new_clk - prev_clk
>     prev_clk = new_clk
>
> The time is constant again:
>
> 0 0.28
> 1 0.28
> 2 0.28
> 3 0.26
> 4 0.26
>
> I see no reason for this kind of performance problem, really. It
> happens when both things are true: dictionaries are kept in a list (or
> more generally, in memory) and they store tuples.
>
> As this goes beyond my understanding of Python internals, I would like
> to kindly ask, if anyone has an idea about how to create this data
> structure (list of dictionaries of tuples, assuming that size of all
> dictionaries is the same), in constant time?
>
> Regards,
> Maciej

Let me comment on what happens in you're code:
The place where you create new objects is
keys = [str(x) for x in range(200000)]   # here you create 200000
strings which will be reused ( by reference )
  and
my_dict[key] = (key, key) # here you create a new tuple with 2
elements
                            ( both are key, so you're taking a
reference of existing key object twice )
The tricky part is where you wrote:
    for key in keys:
        my_dict[key] = (key, key)
        list_of_dicts.append(my_dict)  # note that
list_of_dicts.append is in the loop! check upstairs!
This means that my_dict reference will be stored 200000 times, and it
won't be released.
statement
my_dict = {}
will always create new my_dict ( 20 times means 20 new dictionaries )
and start over.
Since python caches free dictionaries ( after delete - they're used
everywhere ),
reuse won't happen, and memory will have to be allocated again.
Lists are internally like arrays, when there is not enough space for
next element, pointer array is doubled, so
there is no huge penalty in the append function. Lists are also reused
from some internal cache.
Dictionaries also have a some growth function. When there is no space
for next key, internal hash map doubles.
The reason why you have a growing time comes from the fact that memory
allocation takes place instead of object
being used by reference. Check the memory usage, and you'll see that
test time is pretty much proportional to overall memory usage.

Regards, Bosko