[Python-Dev] shrinking dicts

M.-A. Lemburg mal@lemburg.com
Tue, 10 Aug 1999 19:39:29 +0200


Vladimir Marangozov wrote:
> 
> Currently, dictionaries always grow until they are deallocated from
> memory. This happens in PyDict_SetItem according to the following
> code (before inserting the new item into the dict):
> 
>         /* if fill >= 2/3 size, double in size */
>         if (mp->ma_fill*3 >= mp->ma_size*2) {
>                 if (dictresize(mp, mp->ma_used*2) != 0) {
>                         if (mp->ma_fill+1 > mp->ma_size)
>                                 return -1;
>                 }
>         }
> 
> The symmetric case is missing and this has intrigued me for a long time,
> but I've never had the courage to look deeply into this portion of code
> and try to propose a solution. Which is: reduce the size of the dict by
> half when the nb of used items <= 1/6 the size.
> 
> This situation occurs far less frequently than dict growing, but anyways,
> it seems useful for the degenerate cases where a dict has a peek usage,
> then most of the items are deleted. This is usually the case for global
> dicts holding dynamic object collections, etc.
> 
> A bonus effect of shrinking big dicts with deleted items is that
> the lookup speed may be improved, because of the cleaned <dummy> entries
> and the reduced overall size (resulting in a better hit ratio).
> 
> The (only) solution I could came with for this pb is the appended patch.
> It is not immediately obvious, but in practice, it seems to work fine.
> (inserting a print statement after the condition, showing the dict size
>  and current usage helps in monitoring what's going on).
> 
> Any other ideas on how to deal with this? Thoughts, comments?

I think that integrating this into the C code is not really that
effective since the situation will not occur that often and then
it often better to let the programmer decide rather than integrate
an automatic downsize.

You can call dict.update({}) to force an internal
resize (the empty dictionary can be made global since it is not
manipulated in any way and thus does not cause creation overhead).

Perhaps a new method .resize(approx_size) would make this even
clearer. This would also have the benefit of allowing a programmer
to force allocation of the wanted size, e.g.

d = {}
d.resize(10000)
# Insert 10000 items in a batch insert

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                   143 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/