[Cython] Hash-based vtables

Wed Jun 6 23:00:58 CEST 2012

On Tue, Jun 5, 2012 at 2:41 PM, Dag Sverre Seljebotn
<d.s.seljebotn at astro.uio.no> wrote:
> On 06/05/2012 10:50 PM, Robert Bradshaw wrote:
>>
>> On Tue, Jun 5, 2012 at 1:10 PM, Dag Sverre Seljebotn
>> <d.s.seljebotn at astro.uio.no>  wrote:
>>>
>>> On 06/04/2012 11:43 PM, Robert Bradshaw wrote:
>>>>
>>>>
>>>> On Mon, Jun 4, 2012 at 1:55 PM, Dag Sverre Seljebotn
>>>> <d.s.seljebotn at astro.uio.no>    wrote:
>>>>>
>>>>>
>>>>> On 06/04/2012 09:44 PM, Dag Sverre Seljebotn wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> Me and Robert had a long discussion on the NumFOCUS list about this
>>>>>> already, but I figured it was better to continue it and provide more
>>>>>> in-depth benchmark results here.
>>>>>>
>>>>>> It's basically a new idea of how to provide a vtable based on perfect
>>>>>> hashing, which should be a lot simpler to implement than what I first
>>>>>> imagined.
>>>>>>
>>>>>> I'll write down some context first, if you're familiar with this
>>>>>> skip ahead a bit..
>>>>>>
>>>>>> This means that you can do fast dispatches *without* the messy
>>>>>> business of binding vtable slots at compile time. To be concrete, this
>>>>>> might e.g. take the form
>>>>>>
>>>>>> def f(obj):
>>>>>> obj.method(3.4) # try to find a vtable with "void method(double)" in
>>>>>> it
>>>>>>
>>>>>> or, a more typed approach,
>>>>>>
>>>>>> # File A
>>>>>> cdef class MyImpl:
>>>>>> def double method(double x): return x * x
>>>>>>
>>>>>> # File B
>>>>>> # Here we never know about MyImpl, hence "duck-typed"
>>>>>> @cython.interface
>>>>>> class MyIntf:
>>>>>> def double method(double x): pass
>>>>>>
>>>>>> def f(MyIntf obj):
>>>>>> # obj *can* be MyImpl instance, or whatever else that supports
>>>>>> # that interface
>>>>>> obj.method(3.4)
>>>>>>
>>>>>>
>>>>>> Now, the idea to implement this is:
>>>>>>
>>>>>> a) Both caller and callee pre-hash name/argument string
>>>>>> "mymethod:iidd" to 64 bits of hash data (probably lower 64 bits of
>>>>>> md5)
>>>>>>
>>>>>> b) Callee (MyImpl) generates a vtable of its methods by *perfect*
>>>>>> hashing. What you do is define a final hash fh as a function
>>>>>> of the pre-hash ph, for instance
>>>>>>
>>>>>> fh = ((ph>>    vtable.r1) ^ (ph>>    vtable.r2) ^ (ph>>
>>>>>>  vtable.r3))&
>>>>>> vtable.m
>>>>>>
>>>>>> (Me and Robert are benchmarking different functions to use here.) By
>>>>>> playing with r1, r2, r3, you have 64**3 choices of hash function, and
>>>>>> will be able to pick a combination which gives *no* (or very few)
>>>>>> collisions.
>>>>>>
>>>>>> c) Caller then combines the pre-hash generated at compile-time, with
>>>>>> r1, r2, r3, m stored in the vtable header, in order to find the
>>>>>> final location in the hash-table.
>>>>>>
>>>>>> The exciting thing is that in benchmark, the performance penalty is
>>>>>> actually very slight over a C++-style v-table. (Of course you can
>>>>>> cache a proper vtable, but the fact that you get so close without
>>>>>> caring about caching means that this can be done much faster.)
>>>>
>>>>
>>>>
>>>> One advantage about caching a vtable is that one can possibly put in
>>>> adapters for non-exact matches. It also opens up the possibility of
>>>> putting in stubs to call def methods if they exist. This needs to be
>>>> fleshed out more, (another CEP :) but could provide for a
>>>> backwards-compatible easy first implementation.
>>>>
>>>>>> Back to my and Robert's discussion on benchmarks:
>>>>>>
>>>>>> I've uploaded benchmarks here:
>>>>>>
>>>>>> https://github.com/dagss/hashvtable/tree/master/dispatchbench
>>>>>>
>>>>>> I've changed the benchmark taking to give more robust numbers (at
>>>>>> least for me), you want to look at the 'min' column.
>>>>>>
>>>>>> I changed the benchmark a bit so that it benchmarks a *callsite*.
>>>>>> So we don't pass 'h' on the stack, but either a) looks it up in a
>>>>>> global
>>>>>> variable (default), or b) it's a compile-time constant (immediate in
>>>>>> assembly) (compile with -DIMHASH).
>>>>>>
>>>>>> Similarly, the ID is either an "interned" global variable, or an
>>>>>> immediate (-DIMID).
>>>>>>
>>>>>> The results are very different on my machine depending on this aspect.
>>>>>> My conclusions:
>>>>>>
>>>>>> - Both three shifts with masking, two shifts with a "fallback slot"
>>>>>> (allowing for a single collision), three shifts, two shifts with
>>>>>> two masks allows for constructing good vtables. In the case of only
>>>>>> two shifts, one colliding method gets the twoshift+fback
>>>>>> performance and the rest gets the twoshift performance.
>>>>>>
>>>>>> - Performance is really more affected by whether hashes are
>>>>>> immediates or global variables than the hash function. This is in
>>>>>> contrast to the interning vs. key benchmarks -- so I think that if
>>>>>> we looked up the vtable through PyTypeObject, rather than getting
>>>>>> the vtable directly, the loads of the global variables could
>>>>>> potentially be masked by that.
>>>>>>
>>>>>> - My conclusion: Just use lower bits of md5 *both* for the hashing
>>>>>> and the ID-ing (don't bother with any interning), and compile the
>>>>>> thing as a 64-bit immediate. This can cause crashes/stack smashes
>>>>>> etc. if there's lower-64bit-of-md5 collisions, but a) the
>>>>>> probability is incredibly small, b) it would only matter in
>>>>>> situations that should cause an AttributeError anyway, c) if we
>>>>>> really care, we can always use an interning-like mechanism to
>>>>>> validate on module loading that its hashes doesn't collide with
>>>>>> other hashes (and raise an exception "Congratulations, you've
>>>>>> discovered a phenomenal md5 collision, get in touch with cython
>>>>>> devs and we'll work around it right away").
>>>>
>>>>
>>>>
>>>> Due to the birthday paradox, this seems a bit risky. Maybe it's
>>>> because I regularly work with collections much bigger than 2^32, and I
>>>> suppose we're talking about unique method names and signatures here,
>>>> but still... I wonder what the penalty would be for checking the full
>>>> 128 bit hash. (Storing it could allow for greater entropy in the
>>>> optimal hash table search as well).
>>>
>>>
>>>
>>> Wonder no more. Here's the penalty for different bit-lengths, all
>>> compile-time constants:
>>>
>>>     threeshift: min=6.08e-09  mean=6.11e-09  std=2.81e-11
>>> val=1200000000.000000
>>>   threeshift96: min=7.53e-09  mean=7.55e-09  std=1.96e-11
>>> val=1200000000.000000
>>>  threeshift128: min=6.95e-09  mean=6.97e-09  std=2.57e-11
>>> val=1200000000.000000
>>>  threeshift160: min=8.17e-09  mean=8.23e-09  std=4.06e-11
>>> val=1200000000.000000
>>>
>>> And for comparison, when loading the comparison IDs from global variable:
>>>
>>>     threeshift: min=6.46e-09  mean=6.52e-09  std=4.95e-11
>>> val=1200000000.000000
>>>   threeshift96: min=8.07e-09  mean=8.16e-09  std=4.55e-11
>>> val=1200000000.000000
>>>  threeshift128: min=8.06e-09  mean=8.18e-09  std=6.71e-11
>>> val=1200000000.000000
>>>  threeshift160: min=9.71e-09  mean=9.83e-09  std=5.12e-11
>>> val=1200000000.000000
>>>
>>> So indeed,
>>>
>>> 64-bit hash<  interning<  128 bit hash
>>>
>>> (At least on my Intel Nehalem Core i7 1.87GhZ)
>>>
>>> And the load of the global variable may in real life be hidden by other
>>> things going on in the function.
>>>
>>> And, you save vtable memory by having an interned char* and not saving
>>> the
>>> hash in the vtable.
>>
>>
>> I'm OK with using the 64-bit hash with a macro to enable further
>> checking. If it becomes an issue, we can partition the vtable into two
>> separate structures (hash64/pointer/flags? + hash160/char*/metadata).
>> That's probably overkill. With an eye to security, perhaps the spec
>> should be sha1 (or sha2?, not sure if that ships with Python).
>
>
> No, I like splitting up the table, I was assuming we'd stick the char* in a
> different table anyway. Cache is precious, and the second table would be
> completely cold in most situations.
>
> Is the goal then to avoid having to have an interning registry?

Yes, and to avoid invoking an expensive hash function at runtime in
order to achieve good distribution.

> Something that hasn't come up so far is that Cython doesn't know the exact
> types of external typedefs, so it can't generate the hash at Cythonize-time.
> I guess some support for build systems to probe for type sizes and compute
> the signature hashes in a sepearate header file would solve this -- with a
> fallback to computing them runtime at module loading, if you're not using a
> supported build system. (But suddenly an interning registry doesn't look so
> horrible..)

It all depends on how strict you want to be. It may be acceptable to
let f(int) and f(long) not hash to the same value even if sizeof(int)
== sizeof(long). We could also promote all int types to long or long
long, including extern times (assuming, with a c-compile-time check,
external types declared up to "long" are <= sizeof(long)). Another
option is to let the hash be md5(sig) + hashN(sizeof(extern_arg1),
sizeof(extern_argN)) where hashN is a macro.

> Really, I think a micro-benchmark is rather pessimistic about the
> performance of loading a global variable -- if more stuff happens around the
> call site then the load will likely be moved ahead and the latency hidden.
> Perhaps this might even be the case just for going the route through
> extensibletypeobject.
>
>
>>> They should be made more easily runnable so that we could run them on
>>> various systems, but it makes sense to first read up on and figure out
>>> which
>>> hash functions are really viable, to keep the number of numbers down.
>>>
>>> I just realized that I never pushed the changes I did to introduce
>>> -DIMHASH/-DIMID etc., but the benchmarks are pushed now.
>>>
>>>
>>>
>>>> We could also do a fallback table. Usually it'd be empty, Occasionally
>>>> it'd have one element in it. It'd always be possible to make this big
>>>> enough to avoid collisions in a worst-case scenario.
>>>
>>>
>>>
>>> If you do a fallback table it's as much code in the call site as linear
>>> probing...
>>
>>
>> Is linear probing that bad? It's an extra increment and compare in the
>> miss case.
>>
>>> But when I played with the generation side, a failure to create a table
>>> at a
>>> given size would *always* be due to a single collision. This is what I
>>> did
>>> in the twoshift+fback benchmark.
>>
>>
>> But it won't always be. One can always increase the size of the main
>> table however, if two collisions are rare enough.
>
>
> Yes of course, I didn't test 100% fill of a 64-entry table. I was more
> concerned with making the table 128 or 256 rather than having to go to 512
> :-)
>
>
>>>> Duplicate tables works as long as there aren't too many orthogonal
>>>> considerations. Is the GIL the only one? What about "I can propagate
>>>> errors?" Now we're up to 4 tables...
>>>
>>>
>>> Would your decision of whether or not to dispatch to a function depend on
>>> whether or not it propagates errors?
>>>
>>> I'm thinking of the "with gil" function case, i.e. callee has:
>>>
>>>  a) Function to call if you have the GIL
>>>  b) GIL-acquiring wrapper
>>>
>>> and you want GIL-holding code to call a) and nogil code to call b).
>>>
>>> But one could just make the caller acquire the GIL if needed (which in
>>> that
>>> case is so expensive anyway that it can be made the unlikely() path).
>>
>>
>> Are you saying you'd add code to the call site to determine if it
>> needs (and conditionally acquire) the GIL?
>
>
> Well, I'm saying it's an alternative, I'm not sure if it has merit.
> Basically shift the "with gil" responsibility to the caller in this case.
>
>
>>
>>> I can't think of other situations where you would pick which function to
>>> call based on flags.
>>
>>
>> If the caller doesn't propagate errors, it may want to have different
>> codepaths depending on whether the callee propagates them.
>
>
> Not sure if I understand. Would you call a *different* incarnation of the
> callee depending on this, and need different function pointers for different
> callers?
>
> Otherwise you just check flags after the call and take the appropriate
> action, with a likely() around the likely one. You need flags, but not a
> different table.

Fair enough.

- Robert