[Cython] Hash-based vtables

Tue Jun 5 00:30:38 CEST 2012

On Mon, Jun 4, 2012 at 3:07 PM, Dag Sverre Seljebotn
<d.s.seljebotn at astro.uio.no> wrote:
>
>
> Robert Bradshaw <robertwb at gmail.com> wrote:
>
>>On Mon, Jun 4, 2012 at 1:55 PM, Dag Sverre Seljebotn
>><d.s.seljebotn at astro.uio.no> wrote:
>>> On 06/04/2012 09:44 PM, Dag Sverre Seljebotn wrote:
>>>>
>>>> Me and Robert had a long discussion on the NumFOCUS list about this
>>>> already, but I figured it was better to continue it and provide more
>>>> in-depth benchmark results here.
>>>>
>>>> It's basically a new idea of how to provide a vtable based on
>>perfect
>>>> hashing, which should be a lot simpler to implement than what I
>>first
>>>> imagined.
>>>>
>>>> I'll write down some context first, if you're familiar with this
>>>> skip ahead a bit..
>>>>
>>>> This means that you can do fast dispatches *without* the messy
>>>> business of binding vtable slots at compile time. To be concrete,
>>this
>>>> might e.g. take the form
>>>>
>>>> def f(obj):
>>>> obj.method(3.4) # try to find a vtable with "void method(double)" in
>>it
>>>>
>>>> or, a more typed approach,
>>>>
>>>> # File A
>>>> cdef class MyImpl:
>>>> def double method(double x): return x * x
>>>>
>>>> # File B
>>>> # Here we never know about MyImpl, hence "duck-typed"
>>>> @cython.interface
>>>> class MyIntf:
>>>> def double method(double x): pass
>>>>
>>>> def f(MyIntf obj):
>>>> # obj *can* be MyImpl instance, or whatever else that supports
>>>> # that interface
>>>> obj.method(3.4)
>>>>
>>>>
>>>> Now, the idea to implement this is:
>>>>
>>>> a) Both caller and callee pre-hash name/argument string
>>>> "mymethod:iidd" to 64 bits of hash data (probably lower 64 bits of
>>>> md5)
>>>>
>>>> b) Callee (MyImpl) generates a vtable of its methods by *perfect*
>>>> hashing. What you do is define a final hash fh as a function
>>>> of the pre-hash ph, for instance
>>>>
>>>> fh = ((ph >> vtable.r1) ^ (ph >> vtable.r2) ^ (ph >> vtable.r3)) &
>>>> vtable.m
>>>>
>>>> (Me and Robert are benchmarking different functions to use here.) By
>>>> playing with r1, r2, r3, you have 64**3 choices of hash function,
>>and
>>>> will be able to pick a combination which gives *no* (or very few)
>>>> collisions.
>>>>
>>>> c) Caller then combines the pre-hash generated at compile-time, with
>>>> r1, r2, r3, m stored in the vtable header, in order to find the
>>>> final location in the hash-table.
>>>>
>>>> The exciting thing is that in benchmark, the performance penalty is
>>>> actually very slight over a C++-style v-table. (Of course you can
>>>> cache a proper vtable, but the fact that you get so close without
>>>> caring about caching means that this can be done much faster.)
>>
>>One advantage about caching a vtable is that one can possibly put in
>>adapters for non-exact matches. It also opens up the possibility of
>>putting in stubs to call def methods if they exist. This needs to be
>>fleshed out more, (another CEP :) but could provide for a
>>backwards-compatible easy first implementation.
>>
>>>> Back to my and Robert's discussion on benchmarks:
>>>>
>>>> I've uploaded benchmarks here:
>>>>
>>>> https://github.com/dagss/hashvtable/tree/master/dispatchbench
>>>>
>>>> I've changed the benchmark taking to give more robust numbers (at
>>>> least for me), you want to look at the 'min' column.
>>>>
>>>> I changed the benchmark a bit so that it benchmarks a *callsite*.
>>>> So we don't pass 'h' on the stack, but either a) looks it up in a
>>global
>>>> variable (default), or b) it's a compile-time constant (immediate in
>>>> assembly) (compile with -DIMHASH).
>>>>
>>>> Similarly, the ID is either an "interned" global variable, or an
>>>> immediate (-DIMID).
>>>>
>>>> The results are very different on my machine depending on this
>>aspect.
>>>> My conclusions:
>>>>
>>>> - Both three shifts with masking, two shifts with a "fallback slot"
>>>> (allowing for a single collision), three shifts, two shifts with
>>>> two masks allows for constructing good vtables. In the case of only
>>>> two shifts, one colliding method gets the twoshift+fback
>>>> performance and the rest gets the twoshift performance.
>>>>
>>>> - Performance is really more affected by whether hashes are
>>>> immediates or global variables than the hash function. This is in
>>>> contrast to the interning vs. key benchmarks -- so I think that if
>>>> we looked up the vtable through PyTypeObject, rather than getting
>>>> the vtable directly, the loads of the global variables could
>>>> potentially be masked by that.
>>>>
>>>> - My conclusion: Just use lower bits of md5 *both* for the hashing
>>>> and the ID-ing (don't bother with any interning), and compile the
>>>> thing as a 64-bit immediate. This can cause crashes/stack smashes
>>>> etc. if there's lower-64bit-of-md5 collisions, but a) the
>>>> probability is incredibly small, b) it would only matter in
>>>> situations that should cause an AttributeError anyway, c) if we
>>>> really care, we can always use an interning-like mechanism to
>>>> validate on module loading that its hashes doesn't collide with
>>>> other hashes (and raise an exception "Congratulations, you've
>>>> discovered a phenomenal md5 collision, get in touch with cython
>>>> devs and we'll work around it right away").
>>
>>Due to the birthday paradox, this seems a bit risky. Maybe it's
>>because I regularly work with collections much bigger than 2^32, and I
>>suppose we're talking about unique method names and signatures here,
>>but still... I wonder what the penalty would be for checking the full
>>128 bit hash. (Storing it could allow for greater entropy in the
>>optimal hash table search as well).
>>
>>> What I forgot to mention:
>>>
>>>  - I really want to avoid linear probing just because of the code
>>bloat in
>>> call sites.
>>
>>That's a good point. What about flags--are we throwing out the idea of
>>masking?
>>
>>> With two shifts, when there was a failure to find a perfect hash
>>> it was always possible to find one with a single collision.
>>>
>>>  - Probing for the hash with two shifts is lightning fast, it can
>>take a
>>> while with three shifts (though you can always spend more memory on a
>>bigger
>>> table to make it fast again). However, it makes me uneasy to penalize
>>the
>>> performance of calling one of the random methods, so I'm really in
>>favour of
>>> three-shifts or double-mask (to be decided when investigating the
>>> performance of probing for parameters in more detail).
>>>
>>>  - I tried using SSE to do shifts in parallel and failed (miserable
>>> performance). The problem is quickly moving things between general
>>purpose
>>> registers and SSE registers, and the lack of SSE immediates/constants
>>in the
>>> instruction stream. At least, what my gcc 4.6 generates appeared to
>>use the
>>> stack to communicate between SSE registers and general purpose
>>registers
>>> (but I can't have been doing the right thing..).
>>>
>>>
>>>
>>>>
>>>> The RTTI (i.e. the char*) is also put in there, but is not used for
>>>> comparison and is not interned.
>>>>
>>>> At least, that's what I think we should do for duck-style vtables.
>>>>
>>>> Do we then go to all the pain of defining key-encoding, interning
>>>> etc. just for SEP 201? Isn't it easier to just mandate a md5
>>dependency
>>>> and be done with it? (After all, md5 usually comes with Python in
>>the
>>>> md5 and hashlib modules)
>>>>
>>>> direct: Early-binding
>>>> index: Call slot 0 (C++-style vtable/function pointer)
>>>> noshift: h & m1
>>>> oneshift: (h >> r1) & m1
>>>> twoshift: ((h >> r1) ^ (h >> r2)) & m1
>>>> twoshift+fback: hash doesn't
>>>
>>>
>>> I meant: Hash collision and then, after a branch miss, look up the
>>one
>>> fallback slot in the vtable header.
>>
>>We could also do a fallback table. Usually it'd be empty, Occasionally
>>it'd have one element in it. It'd always be possible to make this big
>>enough to avoid collisions in a worst-case scenario.
>>
>>BTW, this is a general static char* -> void* dictionary, I bet it
>>could possibly have other uses. (It may also be a well-studied
>>problem, though a bit hard to search for...) I suppose we could reduce
>>it to read-optimized int -> int mappings.
>
>
> The C FAQ says 'if you know the contents of your hash table up front you can devise a perfect hash', but no details, probably just hand-waving.

I just found http://cmph.sourceforge.net/ which looks quite
interesting. Though the resulting hash functions are supposedly cheap,
I have the feeling that branching is considered cheap in this context.

> 128 bits gives more entropy for perfect hashing: some but not much since each shift r is hardwired to one 64 bit subset.

True. I don't have a good way to quantify the correlation between
different shifts of the same value (vs. truly random values) but it
didn't seem to be very significant in the experiments.

> From the interning/key benchmarks, checking the full 128 bits would probably not be noticeable in microbenchmarks, it's more about using an extra register and bloating the instruction cache and data cache a bit etc, stuff that can only be measured in production.

One could make the check optionally omitted at compile time. It would
still bloat the table, but not by much (or at all if we share with
flag bits as suggested below).

> The alternative is having a collision detection registry. If it complains, you're told where to edit Cython (perhaps a datafile) so that the pre-hash function changes:
>
> if signature equals 'foo:ddffi'
>   # known collision with 'bar:ii'
>    Use high 64 bits of md5
> Else:
>   Use low 64 bits of md5
>
> Each such collision is documented in the cep/sep.
>
> But 128 bit and then relying on luck is perhaps simpler...

Much.

> If we need flags, lets say that 92 bits suffice. for hash and use 16 for flags...
>
> But i was thinking that you'd have separate tables for nogil callers and gil-holding callers so that you didn't need to scan for matching flags. We really want this to be branch-miss-free. Still, flags are good for error return codes etc.

Duplicate tables works as long as there aren't too many orthogonal
considerations. Is the GIL the only one? What about "I can propagate
errors?" Now we're up to 4 tables...

> Do you agree on forgetting about the encoded keys/interning even for SEP 201? There's only so much effort to go around and I'd much rather use md5 and these hash tables everywhere.

Yes, for sure!

- Robert