[Python-Dev] Reducing memory overhead for dictionaries by removing me_hash

Sun Apr 23 08:45:41 CEST 2006

Kirat Singh wrote:
> Hi, this is my first python dev post, so please forgive me if this topic
> has already been discussed.

To my knowledge, this hasn't been discussed before.

> It seemed to me that removing me_hash from a dict entry would save 2/3
> of the space used by dictionaries and also improve alignment of the
> entries since they'd be 8 bytes instead of 12. And sets end up having
> just 4 byte entries.

How did you compute the saving of 2/3? If one field goes and two fields
stay, it saves 1/3, no? Also, the "improved" alignment isn't that much
of an improvement on a 32-bit system, AFAICT.

Likewise, on a 64-bit system, me_hash consumes 8 bytes (regardless of
whether a long is 4 or 8 bytes), so the saving would be 1/3, and the
alignment would change (not improve) from 8 to 16.

> I'm guessing that string dicts are the most common (hence the
> specialized lookupdict_string routine), and since strings already
> contain their hash, this would probably mitigate the performance impact.
> One could also add a hash to Tuples since they are immutable.

You implicitly give the reason for the me_hash field: performance.
This is a classical space-vs-time trade-off, currently leaning towards
more-space-less-time. You want the trade-off be the other way:
less-space-more-time.

As you might not be aware of how much more time it will take,
consider these uses of the me_hash field:
- this is open addressing, so when performing a lookup
  1. computes the hash of the key
  2. looks at the address according to the hash
  3. If there is no key at that address, lookup fails
     (and a free slot is found)
  4. if the hash of the entry at that address is equal,
     it compares the keys. If they are equal, lookup
     succeeds.
  5. If the hashes are not equal, or the keys are not
     equal, an alternative address is computed, and
     search continues with step 3.
  Under your algorithm, you would have additional object
  comparisons in case of hash values that translate to
  the same address. This might slow down dictionaries
  that are close to being full.
- when the dictionary is resized, all entries need to
  be rehashed. So if me_hash is removed, you have many
  more PyObject_Hash calls on resizing. This would
  likely slow down resizing significantly.

> If this isn't a totally stupid idea, I'd be happy to volunteer to try
> the experiment and run any suggested tests.

I don't see much value in saving a few bytes, and I expect that
performance will go down noticeably. Still, only an experiment
could show what the real impact is.

Regards,
Martin