[Cython] CEP1000: Native dispatch through callables

Thu Apr 19 12:56:58 CEST 2012

On 04/19/2012 11:07 AM, Nathaniel Smith wrote:
> On Wed, Apr 18, 2012 at 10:58 PM, Dag Sverre Seljebotn
> <d.s.seljebotn at astro.uio.no>  wrote:
>> On 04/18/2012 11:35 PM, Dag Sverre Seljebotn wrote:
>>>
>>> On 04/17/2012 02:24 PM, Dag Sverre Seljebotn wrote:
>>>>
>>>> On 04/13/2012 12:11 AM, Dag Sverre Seljebotn wrote:
>>>>>
>>>>> Travis Oliphant recently raised the issue on the NumPy list of what
>>>>> mechanisms to use to box native functions produced by his Numba so
>>>>> that SciPy functions can call it, e.g. (I'm making the numba part
>>>>> up):
>>>>>
>>>>> @numba # Compiles function using LLVM def f(x): return 3 * x
>>>>>
>>>>> print scipy.integrate.quad(f, 1, 2) # do many callbacks natively!
>>>>>
>>>>> Obviously, we want something standard, so that Cython functions can
>>>>> also be called in a fast way.
>>>>
>>>>
>>>> OK, here's the benchmark code I've written:
>>>>
>>>> https://github.com/dagss/cep1000
>>>>
>>>> Assumptions etc.:
>>>>
>>>> - (Very) warm cache case is tested
>>>>
>>>> - I compile and link libmycallable.so, libmycaller.so and ./bench; with
>>>> -fPIC, to emulate the Python environment
>>>>
>>>> - I use mostly pure C but use PyTypeObject in order to get the offsets
>>>> to tp_flags etc right (I emulate the checking that would happen on a
>>>> PyObject* according to CEP1000).
>>>>
>>>> - The test function is "double f(double x) { return x * x; }
>>>>
>>>> - The benchmark is run in a loop J=1000000 times (and time divided by
>>>> J). This is repeated K=10000 times and the minimum walltime of the K run
>>>> is used. This gave very stable readings on my system.
>>>>
>>>> Fixing loop iterations:
>>>>
>>>> In the initial results I just scanned the overload list until
>>>> NULL-termination. It seemed to me that the code generated for this
>>>> scanning was the most important factor.
>>>>
>>>> Therefore I fixed the number of overloads as a known compile-time macro
>>>> N *in the caller*. This is somewhat optimistic; however I didn't want to
>>>> play with figuring out loop unrolling etc. at the same time, and
>>>> hardcoding the length of the overload list sort of took that part out of
>>>> the equation.
>>>>
>>>>
>>>> Table explanation:
>>>>
>>>> - N: Number of overloads in list. For N=10, there's 9 non-matching
>>>> overloads in the list before the matching 10 (but caller doesn't know
>>>> this). For N=1, the caller knows this and optimize for a hit in the
>>>> first entry.
>>>>
>>>> - MISMATCHES: If set, the caller tries 4 non-matching signatures before
>>>> hitting the final one. If not set, only the correct signature is tried.
>>>>
>>>> - LIKELY: If set, a GCC likely() macro is used to expect that the
>>>> signature matches.
>>>>
>>>>
>>>> RESULTS:
>>>>
>>>> Direct call (and execution of!) the function in benchmark loop took
>>>> 4.8 ns.
>>>>
>>>> An indirect dispatch through a function pointer of known type took 5.4 ns
>>>>
>>>> Notation below is (intern key), in ns
>>>>
>>>> N=1:
>>>> MISMATCHES=False:
>>>> LIKELY=True: 6.44 6.44
>>>> LIKELY=False: 7.52 8.06
>>>> MISMATCHES=True: 8.59 8.59
>>>> N=10:
>>>> MISMATCHES=False: 17.19 19.20
>>>> MISMATCHES=True: 36.52 37.59
>>>>
>>>> To be clear, "intern" is an interned "char*" (comparison with a 64 bits
>>>> global variable), while key is comparison of a size_t (comparison of a
>>>> 64-bit immediate in the instruction stream).
>>>
>>>
>>> First: My benchmarks today are a little inconsistent with earlier
>>> results. I think I have converged now in terms of number of iterations
>>> (higher than last time), but that doesn't explain why indirect dispatch
>>> through function pointer is now *higher*:
>>>
>>> Direct took 4.83 ns
>>> Dispatch took 5.91 ns
>>>
>>> Anyway, even if crude, hopefully this will tell us something. Order of
>>> benchmark numbers are:
>>>
>>> intern key get_func_intern get_func_key
>>>
>>> where the get_func_XX versions retrieve a function pointer taking either
>>> a single interned signature or a single key as argument (just see
>>> mycallable.c).
>>>
>>> In the MISMATCHES case, the get_func_XX is called 4 times with a miss
>>> and then with the match.
>>>
>>> N=1
>>> - MISMATCHES=False:
>>> --- LIKELY=True: 5.91 6.44 8.59 9.13
>>> --- LIKELY=False: 7.52 7.52 9.13 9.13
>>> - MISMATCHES=True: 11.28 11.28 22.56 22.56
>>>
>>> N=10
>>> - MISMATCHES=False: 17.18 18.80 29.75 10.74(*)
>>> - MISMATCHES=True: 36.06 38.13 105.00 36.52
>>>
>>> Benchmark comments:
>>>
>>> The one marked (*) is implemented as a switch statement with known keys
>>> compile-time. I tried shifting around the case label values a bit but
>>> the result persists; it could just be that the compiler does a very good
>>> job of the switch as well.
>>
>>
>> I should make this clearer: The issue is that the compiler may have
>> reordered the labels so that the hit came close to first; in the intern case
>> the code is written so that the hit is always after 9 mismatches.
>>
>> So I redid the (*) test using 10 cases with very different numeric values,
>> and then tried each 10 as the matching case. Timings were stable for each
>> choice of label (so this is not noise), with values:
>>
>> 13.4 11.8 11.8 12.3 10.7 11.2 12.3
>>
>> Guess this is the binary decision tree Mark talked about...
>
> Yes, if you look at the ASM (this is easier to keep track of if you
> make the switch cases into round decimal numbers, like 1000, 2000,
> 3000...), then you can see that gcc is generating a fully unrolled
> binary search, as basically a set of nested if/else's, like:
>
> if (value<  5000) {
>    if (value == 2000) {
>      return&ptr_2000;
>    } else if (value == 4000) {
>      return&ptr_4000;
>    }
> } else {
>    if (value == 6000) {
>     return&ptr_6000;
>    } else if (value == 8000) {
>     return&ptr_8000;
>    }
> }
>
> I suppose if we're ambitious we could do the same with the intern
> table for Cython compile-time variants (we don't know the values ahead
> of time, but we know how many there will be, so we'd generate the list
> of intern values, sort it, and then replace the comparison values
> above with table[middle], etc.).

Right. With everything being essentially equal, this isn't getting easier.

I thought of some drawbacks of getfuncptr:

  - Important: Doesn't allow you to actually inspect the supported 
signatures, which is needed (or at least convenient) if you want to use 
an FFI library or do some JIT-ing. So an iteration mechanism is still 
needed in addition, meaning the number of things for the object to 
implement grows a bit large. Default implementations help -- OTOH there 
really wasn't a major drawback with the table approach as long as JIT's 
can just replace it?

  - Minor: I've read that on Intel's Sandy Bridge, the micro-ops are 
cached after instruction decoding, and that micro-ops cache is so 
precious (and decoding so expensive) that the recommendation is no loop 
unrolling at all! So essentially sticking the table in unrolled 
instructions may not continue to be a good idea. (Of course, getfuncptr 
doesn't ).

Dag