[Python-Dev] I think my set module is ready for prime time; comments?

Wed, 24 Jan 2001 12:17:09 -0500

> I meant that the dictionary would keep a slot for "the one and only
> value". First time someone puts a value in the dict, it puts it
> in the "one and only value" slot, and doesn't initalize the value
> array. The second time someone puts a value, it checks for pointer
> equality with that "one and only value". If it is the same, it
> it still doesn't initalize the value array. The only time when
> the dictionary initalizes the value array is when two pointer-different
> values are put in.
> 
> This would let me code
> 
> a[key] = None
> 
> For my sets (but consistent in the same set!)
> 
> a[key] = 1
> 
> When the timbot codes (again, consistent in the same set)
> 
> and
> 
> a[key] = 'present'
> 
> If you're really weird.
> 
> (identifier-like strings get interned)
> 
> That's not *semantics*, that's *optimization* for a commonly
> used (I think) idiom with dictionaries -- you can't predict
> the value, but it will probably remain the same.

This I like!

But note that a dict currently uses 12 bytes per slot in the hash
table (on a 32-bit platform: long me_hash; PyObject *me_key,
*me_value).  The hash table's fill factor is typically between 50 and
67%.

I think removing the hashes would slow down lookups too much, so
optimizing identical values out would only save 6-8 bytes per existing
key on average.  Not clear if it's worth enough.  I think I have to
agree with Tim's expectation that two (or three) separate parallel
arrays will reduce the cache locality and thus slow things down.  Once
you start probing, you jump through the hashtable at large random
strides, causing bad cache performance (for largeish hash tables); but
since often enough the first slot tried is right, you have the hash,
key and value right next together, typically on the same cache line.

--Guido van Rossum (home page: http://www.python.org/~guido/)