[Cython] Hash-based vtables

Thu Jun 7 14:24:32 CEST 2012

On 06/07/2012 12:20 PM, Dag Sverre Seljebotn wrote:
> On 06/07/2012 12:26 AM, Robert Bradshaw wrote:
>> On Wed, Jun 6, 2012 at 2:36 PM, Dag Sverre Seljebotn
>> <d.s.seljebotn at astro.uio.no> wrote:
>>> On 06/06/2012 11:16 PM, Robert Bradshaw wrote:
>>>>
>>>> On Wed, Jun 6, 2012 at 1:57 PM, Dag Sverre Seljebotn
>>>> <d.s.seljebotn at astro.uio.no> wrote:
>>>>>
>>>>> On 06/06/2012 10:41 PM, Dag Sverre Seljebotn wrote:
>>>>>>
>>>>>>
>>>>>> On 06/05/2012 12:30 AM, Robert Bradshaw wrote:
>>>>>>>
>>>>>>>
>>>>>>> I just found http://cmph.sourceforge.net/ which looks quite
>>>>>>> interesting. Though the resulting hash functions are supposedly
>>>>>>> cheap,
>>>>>>> I have the feeling that branching is considered cheap in this
>>>>>>> context.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Actually, this lead was *very* promising. I believe the very first
>>>>>> reference I actually read through and didn't eliminate after the
>>>>>> abstract totally swept away our home-grown solutions!
>>>>>>
>>>>>> "Hash& Displace" by Pagh (1999) is actually very simple, easy to
>>>>>>
>>>>>> understand, and fast both for generation and (the branch-free)
>>>>>> lookup:
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.69.3753&rep=rep1&type=pdf
>>>>>>
>>>>>>
>>>>>>
>>>>>> The idea is:
>>>>>>
>>>>>> - Find a hash `g(x)` to partition the keys into `b` groups (the paper
>>>>>> requires b> 2n, though I think in practice you can often get away
>>>>>> with
>>>>>> less)
>>>>>>
>>>>>> - Find a hash `f(x)` such that f is 1:1 within each group (which is
>>>>>> easily achieved since groups only has a few elements)
>>>>>>
>>>>>> - For each group, from largest to smallest: Find a displacement
>>>>>> `d[group]` so that `f(x) ^ d` doesn't cause collisions.
>>>>>>
>>>>>> It requires extra storage for the displacement table. However, I
>>>>>> think 8
>>>>>> bits per element might suffice even for vtables of 512 or 1024 in
>>>>>> size.
>>>>>> Even with 16 bits it's rather negligible compared to the
>>>>>> minimum-128-bit
>>>>>> entries of the table.
>>>>>>
>>>>>> I benchmarked these hash functions:
>>>>>>
>>>>>> displace1: ((h>> r1) ^ d[h& 63])& m1
>>>>>> displace2: ((h>> r1) ^ d[h& m2])& m1
>>>>>> displace3: ((h>> r1) ^ d[(h>> r2)& m2])& m1
>>>>>>
>>>>>>
>>>>>> Only the third one is truly in the spirit of the algorithm, but I
>>>>>> think
>>>>>> the first two should work well too (and when h is known compile-time,
>>>>>> looking up d[h& 63] isn't harder than looking up r1 or m1).
>>>>>>
>>>>>>
>>>>>> My computer is acting up and all my numbers today are slower than the
>>>>>> earlier ones (yes, I've disabled turbo-mode in the BIOS for a year
>>>>>> ago,
>>>>>> and yes, I've pinned the CPU speed). But here's today's numbers,
>>>>>> compiled with -DIMHASH:
>>>>>>
>>>>>> direct: min=5.37e-09 mean=5.39e-09 std=1.96e-11 val=2400000000.000000
>>>>>> index: min=6.45e-09 mean=6.46e-09 std=1.15e-11 val=1800000000.000000
>>>>>> twoshift: min=6.99e-09 mean=7.00e-09 std=1.35e-11
>>>>>> val=1800000000.000000
>>>>>> threeshift: min=7.53e-09 mean=7.54e-09 std=1.63e-11
>>>>>> val=1800000000.000000
>>>>>> displace1: min=6.99e-09 mean=7.00e-09 std=1.66e-11
>>>>>> val=1800000000.000000
>>>>>> displace2: min=6.99e-09 mean=7.02e-09 std=2.77e-11
>>>>>> val=1800000000.000000
>>>>>> displace3: min=7.52e-09 mean=7.54e-09 std=1.19e-11
>>>>>> val=1800000000.000000
>>>>>>
>>>>>>
>>>>>> I did a dirty prototype of the table-finder as well and it works:
>>>>>>
>>>>>> https://github.com/dagss/hashvtable/blob/master/pagh99.py
>>>>>
>>>>>
>>>>>
>>>>> The paper obviously puts more effort on minimizing table size and
>>>>> not a
>>>>> fast
>>>>> lookup. My hunch is that our choice should be
>>>>>
>>>>> ((h>> table.r) ^ table.d[h& m2])& m1
>>>>>
>>>>>
>>>>> and use 8-bits d (because even if you have 1024 methods, you'd rather
>>>>> double
>>>>> the number of bins than those 2 extra bits available for displacement
>>>>> options).
>>>>>
>>>>> Then keep incrementing the size of d and the number of table slots (in
>>>>> such
>>>>> an order that the total vtable size is minimized) until success. In
>>>>> practice
>>>>> this should almost always just increase the size of d, and keep the
>>>>> table
>>>>> size at the lowest 2**k that fits the slots (even for 64 methods or
>>>>> 128
>>>>> methods :-))
>>>>>
>>>>> Essentially we avoid the shift in the argument to d[] by making d
>>>>> larger.
>>>>
>>>>
>>>> Nice. I'm surprised that the indirection on d doesn't cost us much;
>>>
>>>
>>> Well, table->d[const& const] compiles down to the same kind of code as
>>> table->m1. I guess I'm surprised too that displace2 doesn't penalize.
>>>
>>>
>>>> hopefully its size wouldn't be a big issue either. What kinds of
>>>> densities were you achieving?
>
> OK, simulation results just in (for the displace2 hash), and they
> exceeded my expectations.
>
> I always fill the table with n=2^k keys, and fix b = n (b means |d|).
> Then the failure rates are (top two are 100,000 simulations, the rest
> are 1000 simulations):
>
> n= 8 b= 8 failure-rate=0.0019 try-mean=4.40 try-max=65
> n= 16 b= 16 failure-rate=0.0008 try-mean=5.02 try-max=65
> n= 32 b= 32 failure-rate=0.0000 try-mean=5.67 try-max=25
> n= 64 b= 64 failure-rate=0.0000 try-mean=6.60 try-max=29
> n= 128 b= 128 failure-rate=0.0000 try-mean=7.64 try-max=22
> n= 256 b= 256 failure-rate=0.0000 try-mean=8.66 try-max=37
> n= 512 b= 512 failure-rate=0.0000 try-mean=9.57 try-max=26
> n=1024 b= 1024 failure-rate=0.0000 try-mean=10.66 try-max=34
>
> Try-mean and try-max is how many r's needed to be tried before success,
> so it gives an indication how much is left before failure.
>
> For the ~1/1000 chance of failure for n=8 and n=16, we would proceed to
> let b=2*n (100,000 simulations):
>
> n= 8 b= 16 failure-rate=0.0001 try-mean=2.43 try-max=65
> n= 16 b= 32 failure-rate=0.0000 try-mean=3.40 try-max=65
>
> NOTE: The 512...2048 results were with 16 bits displacements, with 8 bit
> displacements they mostly failed. So we either need to make each element
> of d 16 bits, or, e.g., store 512 entries in a 1024-slot table (which
> succeeded most of the time with 8 bit displacements). I'm +1 on 16 bits
> displacements.
>
> The algorithm is rather fast and concise:
>
> https://github.com/dagss/hashvtable/blob/master/pagh99.py
>
>>> The algorithm is designed for 100% density in the table itself. (We
>>> can lift
>>> that to compensate for a small space of possible hash functions I
>>> guess.)
>>>
>>> I haven't done proper simulations yet, but I just tried |vtable|=128,
>>> |d|=128 from the command line and I had 15 successes or so before the
>>> first
>>> failure. That's with a 100% density in the vtable itself! (And when it
>>> fails, you increase |d| to get your success).
>>>
>>> The caveat is the space spent on d (it's small in comparison, but
>>> that's why
>>> this isn't too good to be true).
>>>
>>> A disadvantage might be that we may no longer have the opportunity to
>>> not
>>> make the table size a power of two (i.e. replace the mask with "if
>>> (likely(slot< n))"). I think for that to work one would need to
>>> replace the
>>> xor group with addition on Z_d.
>>>
>>>
>>>> Going back to the idea of linear probing on a cache miss, this has the
>>>> advantage that one can write a brain-dead provider that sets m=0 and
>>>> simply lists the methods instead of requiring a table optimizer. (Most
>>>> tools, of course, would do the table optimization.) It also lets you
>>>> get away with a "kind-of good" hash rather than requiring you search
>>>> until you find a (larger?) perfect one.
>>>
>>>
>>> Well, given that we can have 100% density, and generating the table is
>>> lightning fast, and the C code to generate the table is likely a 300
>>> line
>>> utility... I'm not convinced.
>>
>> It goes from an extraordinary simple spec (table is, at minimum, a
>> func[2^k] with a couple of extra zero fields, whose struct can be
>> statically defined in the source by hand) to a, well, not complicated
>> in the absolute sense, but much more so than the definition above. It
>> also is variable-size which makes allocating it globally/on a stack a
>> pain (I suppose one can choose an upper bound for |d| and |vtable|).
>>
>> I am a bit playing devil's advocate here, it's probably just a (minor)
>> con, but worth noting at least.
>
> If you were willing to go the interning route, so that you didn't need
> to fill the table with md5 hashes anyway, I'd say you'd have a stronger
> point :-)

Here's a good reason to demand perfect hashing in the callee:

Suppose you want to first check the interface once, then keep using the 
vtable -- e.g, *if* we want Cython to raise TypeError on the interface 
coercion *and* we decide we don't want to mess with constructing 
C++-style vtables on the fly, then code like this:

cdef f(SomeInterface obj):
     return obj.some_method(1.0)

would simply expect that the vtable contained the method, and skip the 
ID comparison entirely. No comparison is faster than either 64-bit hash 
comparison and interned comparison. :-)

I'm not saying the above decisions must be made, but the possibility 
seems reason enough to demand perfect hashing.

Dag