[Cython] CEP1000: Native dispatch through callables

Tue Apr 17 14:36:45 CEST 2012

On 04/17/2012 02:24 PM, Dag Sverre Seljebotn wrote:
> On 04/13/2012 12:11 AM, Dag Sverre Seljebotn wrote:
>> Travis Oliphant recently raised the issue on the NumPy list of what
>> mechanisms to use to box native functions produced by his Numba so
>> that SciPy functions can call it, e.g. (I'm making the numba part
>> up):
>>
>> @numba # Compiles function using LLVM def f(x): return 3 * x
>>
>> print scipy.integrate.quad(f, 1, 2) # do many callbacks natively!
>>
>> Obviously, we want something standard, so that Cython functions can
>> also be called in a fast way.
>
> OK, here's the benchmark code I've written:
>
> https://github.com/dagss/cep1000
>
> Assumptions etc.:
>
> - (Very) warm cache case is tested
>
> - I compile and link libmycallable.so, libmycaller.so and ./bench; with
> -fPIC, to emulate the Python environment
>
> - I use mostly pure C but use PyTypeObject in order to get the offsets
> to tp_flags etc right (I emulate the checking that would happen on a
> PyObject* according to CEP1000).
>
> - The test function is "double f(double x) { return x * x; }
>
> - The benchmark is run in a loop J=1000000 times (and time divided by
> J). This is repeated K=10000 times and the minimum walltime of the K run
> is used. This gave very stable readings on my system.
>
> Fixing loop iterations:
>
> In the initial results I just scanned the overload list until
> NULL-termination. It seemed to me that the code generated for this
> scanning was the most important factor.
>
> Therefore I fixed the number of overloads as a known compile-time macro
> N *in the caller*. This is somewhat optimistic; however I didn't want to
> play with figuring out loop unrolling etc. at the same time, and
> hardcoding the length of the overload list sort of took that part out of
> the equation.
>
>
> Table explanation:
>
> - N: Number of overloads in list. For N=10, there's 9 non-matching
> overloads in the list before the matching 10 (but caller doesn't know
> this). For N=1, the caller knows this and optimize for a hit in the
> first entry.
>
> - MISMATCHES: If set, the caller tries 4 non-matching signatures before
> hitting the final one. If not set, only the correct signature is tried.
>
> - LIKELY: If set, a GCC likely() macro is used to expect that the
> signature matches.
>
>
> RESULTS:
>
> Direct call (and execution of!) the function in benchmark loop took 4.8 ns.
>
> An indirect dispatch through a function pointer of known type took 5.4 ns
>
> Notation below is (intern key), in ns
>
> N=1:
> MISMATCHES=False:
> LIKELY=True: 6.44 6.44
> LIKELY=False: 7.52 8.06
> MISMATCHES=True: 8.59 8.59
> N=10:
> MISMATCHES=False: 17.19 19.20
> MISMATCHES=True: 36.52 37.59
>
> To be clear, "intern" is an interned "char*" (comparison with a 64 bits
> global variable), while key is comparison of a size_t (comparison of a
> 64-bit immediate in the instruction stream).
>
> PRELIMINARY BENCHMARK CONCLUSION:
>
> Intern appears to be as fast or faster than strcmp.
>
> I don't know why (is the pointer offset to the global variable stored in
> less than 64 bits in the x86-64 instruction stream? What gdb (or other)
> commands would I use to figure that out?)
>
> What happens in the assembly is:
>
> movq (%rdi,%rax), %rax
> movq interned_dd(%rip), %rdx
> cmpq %rdx, (%rax)
> jne .L3
>
> vs.
>
> movabsq $20017697242043, %rdx
> movq (%rdi,%rax), %rax
> cmpq %rdx, (%rax)
> jne .L6
>
>
> TODO:
>
> The caller tried, for each entry in the overload list, to match all the
> signatures. Changing the order of these loops should also be tried.

One more data-point:

When comparing a 96-bit key directly, the fastest benchmark for keys 
(N=1, MISMATCHES=False, LIKELY=True) grows from 6.44 to 6.98 nsec. (It 
should perform relatively better when N>1 unless prefixes match)

448-bit key is 8.59 ns.

Dag