Planning a Python Course for Beginners

Thu Aug 10 18:01:14 EDT 2017

Chris Angelico <rosuav at gmail.com>:

> On Fri, Aug 11, 2017 at 7:17 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
>> That's interesting, but suggests there's something weird (~ suboptimal)
>> going on with CPython's scrambling algorithm. Also, your performance
>> test might yield completely different results on other Python
>> implementations.
>>
>> Apart from uniqueness, there's no particular reason to prefer one
>> __hash__() value over another as long as the interesting bits are inside
>> the CPU's simple integer range.
>
> Not true. Every time you probe a new location [1], you have to fetch
> more data from RAM. That delays you significantly. An ideal hashing
> system is going to give a high probability of giving you an empty
> bucket on the first try, to minimize the number of main memory
> accesses required.
>
> CPython's scrambling algorithm means that, even when its first try
> doesn't succeed, there's a good chance that its second will succeed.
> But that doesn't change the fact that you want the first one to
> succeed as often as possible.

What does all that have to do with where the unique bits are in the hash
value?

      0x10000000
      0x20000000
      0x30000000
      0x40000000

should be no worse as __hash__() values than

      0x1000
      0x2000
      0x3000
      0x4000

or

      1
      2
      3
      4

Of course, some algorithms can (and, we have learned, do) prefer some
bits over others, but that's inside the implementation black box. I
would think every bit should carry an approximately equal weight.

Marko