Planning a Python Course for Beginners

Thu Aug 10 10:45:25 EDT 2017

On Thu, 10 Aug 2017 07:00 pm, Peter Otten wrote:

> Steven D'Aprano wrote:
> 
>> On Wed, 09 Aug 2017 20:07:48 +0300, Marko Rauhamaa wrote:
>> 
>>> Good point! A very good __hash__() implementation is:
>>> 
>>>     def __hash__(self):
>>>         return id(self)
>>> 
>>> In fact, I didn't know Python (kinda) did this by default already. I
>>> can't find that information in the definition of object.__hash__():
>> 
>> 
>> Hmmm... using id() as the hash would be a terrible hash function. Objects
> 
> It's actually id(self) >> 4 (almost, see C code below), to account for
> memory alignment.

Thanks for tracking that down. As you show, the default hash isn't id() itself.

The C code says:

>     /* bottom 3 or 4 bits are likely to be 0; rotate y by 4 to avoid
>        excessive hash collisions for dicts and sets */

which I think agrees with my comment: using the id() itself would put too many
objects in the same bucket (i.e. too many collisions).

>>>> obj = object()
>>>> hex(id(obj))
> '0x7f1f058070b0'
>>>> hex(hash(obj))
> '0x7f1f058070b'
> 
>>>> sample = (object() for _ in range(10))
>>>> all(id(obj) >> 4 == hash(obj) for obj in sample)
> True
> 
>> would fall into similar buckets if they were created at similar times,
>> regardless of their value, rather than being well distributed.
> 
> If that were the problem it wouldn't be solved by the current approach:
> 
>>>> sample = [object() for _ in range(10)]
>>>> [hash(b) - hash(a) for a, b in zip(sample, sample[1:])]
> [1, 1, 1, 1, 1, 1, 1, 1, 1]

Arguably that's a flaw with the current approach that (maybe?) makes object()'s
hash too closely. But:

- perhaps it doesn't matter in practice, since the hash is taken modulo 
  the size of the hash table;

- or maybe Python's dicts and sets are good enough that a difference 
  of 1 is sufficient to give a good distribution of objects in the
  hash table;

- or maybe it does matter, but since people hardly ever use object() 
  itself as the keys in dicts, it doesn't come up.

Here's your example with a class that inherits from object, rather than object
itself:

py> class X(object):
...     pass
...
py> sample = [X() for x in range(10)]
py> [hash(b) - hash(a) for a, b in zip(sample, sample[1:])]
[-5338, -10910, -2976, -2284, -21326, 4, -8, 2, -4]

So my money is on object() being anomalous: because it is so small, the hashes
end up so similar. For "typical" classes, the hash function does a much better
job of mixing the hash values up.

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.