[Cython] Hash-based vtables

Tue Jun 5 00:07:30 CEST 2012

Robert Bradshaw <robertwb at gmail.com> wrote:

>On Mon, Jun 4, 2012 at 1:55 PM, Dag Sverre Seljebotn
><d.s.seljebotn at astro.uio.no> wrote:
>> On 06/04/2012 09:44 PM, Dag Sverre Seljebotn wrote:
>>>
>>> Me and Robert had a long discussion on the NumFOCUS list about this
>>> already, but I figured it was better to continue it and provide more
>>> in-depth benchmark results here.
>>>
>>> It's basically a new idea of how to provide a vtable based on
>perfect
>>> hashing, which should be a lot simpler to implement than what I
>first
>>> imagined.
>>>
>>> I'll write down some context first, if you're familiar with this
>>> skip ahead a bit..
>>>
>>> This means that you can do fast dispatches *without* the messy
>>> business of binding vtable slots at compile time. To be concrete,
>this
>>> might e.g. take the form
>>>
>>> def f(obj):
>>> obj.method(3.4) # try to find a vtable with "void method(double)" in
>it
>>>
>>> or, a more typed approach,
>>>
>>> # File A
>>> cdef class MyImpl:
>>> def double method(double x): return x * x
>>>
>>> # File B
>>> # Here we never know about MyImpl, hence "duck-typed"
>>> @cython.interface
>>> class MyIntf:
>>> def double method(double x): pass
>>>
>>> def f(MyIntf obj):
>>> # obj *can* be MyImpl instance, or whatever else that supports
>>> # that interface
>>> obj.method(3.4)
>>>
>>>
>>> Now, the idea to implement this is:
>>>
>>> a) Both caller and callee pre-hash name/argument string
>>> "mymethod:iidd" to 64 bits of hash data (probably lower 64 bits of
>>> md5)
>>>
>>> b) Callee (MyImpl) generates a vtable of its methods by *perfect*
>>> hashing. What you do is define a final hash fh as a function
>>> of the pre-hash ph, for instance
>>>
>>> fh = ((ph >> vtable.r1) ^ (ph >> vtable.r2) ^ (ph >> vtable.r3)) &
>>> vtable.m
>>>
>>> (Me and Robert are benchmarking different functions to use here.) By
>>> playing with r1, r2, r3, you have 64**3 choices of hash function,
>and
>>> will be able to pick a combination which gives *no* (or very few)
>>> collisions.
>>>
>>> c) Caller then combines the pre-hash generated at compile-time, with
>>> r1, r2, r3, m stored in the vtable header, in order to find the
>>> final location in the hash-table.
>>>
>>> The exciting thing is that in benchmark, the performance penalty is
>>> actually very slight over a C++-style v-table. (Of course you can
>>> cache a proper vtable, but the fact that you get so close without
>>> caring about caching means that this can be done much faster.)
>
>One advantage about caching a vtable is that one can possibly put in
>adapters for non-exact matches. It also opens up the possibility of
>putting in stubs to call def methods if they exist. This needs to be
>fleshed out more, (another CEP :) but could provide for a
>backwards-compatible easy first implementation.
>
>>> Back to my and Robert's discussion on benchmarks:
>>>
>>> I've uploaded benchmarks here:
>>>
>>> https://github.com/dagss/hashvtable/tree/master/dispatchbench
>>>
>>> I've changed the benchmark taking to give more robust numbers (at
>>> least for me), you want to look at the 'min' column.
>>>
>>> I changed the benchmark a bit so that it benchmarks a *callsite*.
>>> So we don't pass 'h' on the stack, but either a) looks it up in a
>global
>>> variable (default), or b) it's a compile-time constant (immediate in
>>> assembly) (compile with -DIMHASH).
>>>
>>> Similarly, the ID is either an "interned" global variable, or an
>>> immediate (-DIMID).
>>>
>>> The results are very different on my machine depending on this
>aspect.
>>> My conclusions:
>>>
>>> - Both three shifts with masking, two shifts with a "fallback slot"
>>> (allowing for a single collision), three shifts, two shifts with
>>> two masks allows for constructing good vtables. In the case of only
>>> two shifts, one colliding method gets the twoshift+fback
>>> performance and the rest gets the twoshift performance.
>>>
>>> - Performance is really more affected by whether hashes are
>>> immediates or global variables than the hash function. This is in
>>> contrast to the interning vs. key benchmarks -- so I think that if
>>> we looked up the vtable through PyTypeObject, rather than getting
>>> the vtable directly, the loads of the global variables could
>>> potentially be masked by that.
>>>
>>> - My conclusion: Just use lower bits of md5 *both* for the hashing
>>> and the ID-ing (don't bother with any interning), and compile the
>>> thing as a 64-bit immediate. This can cause crashes/stack smashes
>>> etc. if there's lower-64bit-of-md5 collisions, but a) the
>>> probability is incredibly small, b) it would only matter in
>>> situations that should cause an AttributeError anyway, c) if we
>>> really care, we can always use an interning-like mechanism to
>>> validate on module loading that its hashes doesn't collide with
>>> other hashes (and raise an exception "Congratulations, you've
>>> discovered a phenomenal md5 collision, get in touch with cython
>>> devs and we'll work around it right away").
>
>Due to the birthday paradox, this seems a bit risky. Maybe it's
>because I regularly work with collections much bigger than 2^32, and I
>suppose we're talking about unique method names and signatures here,
>but still... I wonder what the penalty would be for checking the full
>128 bit hash. (Storing it could allow for greater entropy in the
>optimal hash table search as well).
>
>> What I forgot to mention:
>>
>>  - I really want to avoid linear probing just because of the code
>bloat in
>> call sites.
>
>That's a good point. What about flags--are we throwing out the idea of
>masking?
>
>> With two shifts, when there was a failure to find a perfect hash
>> it was always possible to find one with a single collision.
>>
>>  - Probing for the hash with two shifts is lightning fast, it can
>take a
>> while with three shifts (though you can always spend more memory on a
>bigger
>> table to make it fast again). However, it makes me uneasy to penalize
>the
>> performance of calling one of the random methods, so I'm really in
>favour of
>> three-shifts or double-mask (to be decided when investigating the
>> performance of probing for parameters in more detail).
>>
>>  - I tried using SSE to do shifts in parallel and failed (miserable
>> performance). The problem is quickly moving things between general
>purpose
>> registers and SSE registers, and the lack of SSE immediates/constants
>in the
>> instruction stream. At least, what my gcc 4.6 generates appeared to
>use the
>> stack to communicate between SSE registers and general purpose
>registers
>> (but I can't have been doing the right thing..).
>>
>>
>>
>>>
>>> The RTTI (i.e. the char*) is also put in there, but is not used for
>>> comparison and is not interned.
>>>
>>> At least, that's what I think we should do for duck-style vtables.
>>>
>>> Do we then go to all the pain of defining key-encoding, interning
>>> etc. just for SEP 201? Isn't it easier to just mandate a md5
>dependency
>>> and be done with it? (After all, md5 usually comes with Python in
>the
>>> md5 and hashlib modules)
>>>
>>> direct: Early-binding
>>> index: Call slot 0 (C++-style vtable/function pointer)
>>> noshift: h & m1
>>> oneshift: (h >> r1) & m1
>>> twoshift: ((h >> r1) ^ (h >> r2)) & m1
>>> twoshift+fback: hash doesn't
>>
>>
>> I meant: Hash collision and then, after a branch miss, look up the
>one
>> fallback slot in the vtable header.
>
>We could also do a fallback table. Usually it'd be empty, Occasionally
>it'd have one element in it. It'd always be possible to make this big
>enough to avoid collisions in a worst-case scenario.
>
>BTW, this is a general static char* -> void* dictionary, I bet it
>could possibly have other uses. (It may also be a well-studied
>problem, though a bit hard to search for...) I suppose we could reduce
>it to read-optimized int -> int mappings.

The C FAQ says 'if you know the contents of your hash table up front you can devise a perfect hash', but no details, probably just hand-waving.

128 bits gives more entropy for perfect hashing: some but not much since each shift r is hardwired to one 64 bit subset.

>From the interning/key benchmarks, checking the full 128 bits would probably not be noticeable in microbenchmarks, it's more about using an extra register and bloating the instruction cache and data cache a bit etc, stuff that can only be measured in production.

The alternative is having a collision detection registry. If it complains, you're told where to edit Cython (perhaps a datafile) so that the pre-hash function changes:

if signature equals 'foo:ddffi'
   # known collision with 'bar:ii'
    Use high 64 bits of md5
Else:
   Use low 64 bits of md5

Each such collision is documented in the cep/sep.

But 128 bit and then relying on luck is perhaps simpler...

If we need flags, lets say that 92 bits suffice. for hash and use 16 for flags...

But i was thinking that you'd have separate tables for nogil callers and gil-holding callers so that you didn't need to scan for matching flags. We really want this to be branch-miss-free. Still, flags are good for error return codes etc.

Do you agree on forgetting about the encoded keys/interning even for SEP 201? There's only so much effort to go around and I'd much rather use md5 and these hash tables everywhere.

Dag

>
>- Robert
>_______________________________________________
>cython-devel mailing list
>cython-devel at python.org
>http://mail.python.org/mailman/listinfo/cython-devel

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.