[Python-Dev] PEP for new dictionary implementation

Jim Jewett jimjjewett at gmail.com
Fri Feb 17 04:32:51 CET 2012


On Thu, Feb 16, 2012 at 4:34 PM, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> Am 16.02.2012 19:24, schrieb Jim J. Jewett:

>> PEP author Mark Shannon wrote
>> (in http://mail.python.org/pipermail/python-dev/attachments/20120208/05be469a/attachment.txt):

>>> ... allows ... (the ``__dict__`` attribute of an object) to share
>>> keys with other attribute dictionaries of instances of the same class.

>> Is "the same class" a deliberate restriction, or just a convenience
>> of implementation?

> It's about the implementation: the class keeps a pointer to the key set.
> A subclass has a separate pointer for that.

I would prefer to see that reason in the PEP; after a few years, I
have trouble finding email, even when I remember reading the
conversation.

>> Have you timed not storing the hash (in the dict) at all, at least for
>> (unicode) str-only dicts?  Going to the string for its own cached hash
>> breaks locality a bit more, but saves 1/3 of the memory for combined
>> tables, and may make a big difference for classes that have
>> relatively few instances.

> I'd be in favor of that, but it is actually an unrelated change: whether
> or not you share key sets is unrelated to whether or not str-only dicts
> drop the cached hash.

Except that the biggest arguments against it are that it breaks cache
locality, and it changes the dictentry struct -- which this patch
already does anyway.

> Given a dict, it may be tricky to determine
> whether or not it is str-only, i.e. what layout to use.

Isn't that exactly the same determination needed when deciding whether
or not to use lookdict_unicode?  (It would make the switch to the more
general lookdict more expensive, as that would involve a new
allocation.)

>>> Reduction in memory use is directly related to the number of dictionaries
>>> with shared keys in existence at any time. These dictionaries are typically
>>> half the size of the current dictionary implementation.

>> How do you measure that?  The limit for huge N across huge numbers
>> of dicts should be 1/3 (because both hashes and keys are shared); I
>> assume that gets swamped by object overhead in typical small dicts.

> It's more difficult than that. He also drops the smalltable (which I
> think is a good idea), so accounting how this all plays together is tricky.

All the more reason to explain in the PEP how he measured or approximated it.

>>> If a table is split the values in the keys table are ignored,
>>> instead the values are held in a separate array.

>> If they're just dead weight, then why not use them to hold indices
>> into the array, so that values arrays only have to be as long as
>> the number of keys, rather than rounding them up to a large-enough
>> power-of-two?  (On average, this should save half the slots.)

> Good idea. However, how do you track per-dict how large the table is?

Why would you want to?

The per-instance array needs to be at least as large as the highest
index used by any key for which it has a value; if the keys table gets
far larger (or even shrinks), that doesn't really matter to the
instance.  What does matter to the instance is getting a value of its
own for a new (to it) key -- and then the keys table can tell it which
index to use, which in turn tells it whether or not it needs to grow
the array.

Are are you thinking of len(o.__dict__), which will indeed be a bit
slower?  That will happen with split dicts and potentially missing
values, regardless of how much memory is set aside (or not) for the
missing values.

-jJ


More information about the Python-Dev mailing list