[Cython] CEP1000: Native dispatch through callables

Tue Apr 17 16:20:14 CEST 2012

On Tue, Apr 17, 2012 at 1:24 PM, Dag Sverre Seljebotn
<d.s.seljebotn at astro.uio.no> wrote:
> OK, here's the benchmark code I've written:
>
> https://github.com/dagss/cep1000

This is great!

> Assumptions etc.:
>
>  - (Very) warm cache case is tested
>
>  - I compile and link libmycallable.so, libmycaller.so and ./bench; with
> -fPIC, to emulate the Python environment
>
>  - I use mostly pure C but use PyTypeObject in order to get the offsets
> to tp_flags etc right (I emulate the checking that would happen on a
> PyObject* according to CEP1000).
>
>  - The test function is "double f(double x) { return x * x; }
>
>  - The benchmark is run in a loop J=1000000 times (and time divided by
> J). This is repeated K=10000 times and the minimum walltime of the K run
> is used. This gave very stable readings on my system.
>
> Fixing loop iterations:
>
> In the initial results I just scanned the overload list until
> NULL-termination. It seemed to me that the code generated for this
> scanning was the most important factor.
>
> Therefore I fixed the number of overloads as a known compile-time macro
> N *in the caller*. This is somewhat optimistic; however I didn't want to
> play with figuring out loop unrolling etc. at the same time, and
> hardcoding the length of the overload list sort of took that part out of
> the equation.

Since you've set this up... I have a suggestion for something that may
be worth trying, though I've hesitated to propose it seriously. And
that is, an API where instead of scanning a table, the C-callable
exposes a pointer-to-function, something like
  int get_funcptr(PyObject * self, PyBytesObject * signature, struct
c_function_info * out)

The rationale is, if we want to support JITed functions where new
function pointers may be generated on the fly, the array approach has
a serious problem. You have to decide how many array slots to allocate
ahead of time, and if you run out, then... too bad. I guess you get to
throw away one of the existing pointers (i.e., leak memory) to make
room. Adding an indirection here for the actual lookup means that a
JIT could use a more complex lookup structure if justified, while a
simple C function pointer could just hardcode this to a
signature-check + return, with no lookup step at all. It also would
give us a lot of flexibility for future optimizations (e.g., is it
worth sorting the lookup table in LRU order?). And it would allow for
a JIT to generate a C function pointer on first use, rather than
requiring the first use to go via the Python-level __call__ fallback.
(Which is pretty important when the first use is to fetch the function
pointer before entering an inner loop!)

OTOH the extra indirection will obviously have some overhead, so it'd
be nice to know if it's actually a problem.

> Table explanation:
>
>  - N: Number of overloads in list. For N=10, there's 9 non-matching
> overloads in the list before the matching 10 (but caller doesn't know
> this). For N=1, the caller knows this and optimize for a hit in the
> first entry.
>
>  - MISMATCHES: If set, the caller tries 4 non-matching signatures before
> hitting the final one. If not set, only the correct signature is tried.
>
>  - LIKELY: If set, a GCC likely() macro is used to expect that the signature
> matches.
>
>
> RESULTS:
>
> Direct call (and execution of!) the function in benchmark loop took 4.8 ns.
>
> An indirect dispatch through a function pointer of known type took 5.4 ns
>
> Notation below is (intern key), in ns
>
> N=1:
>  MISMATCHES=False:
>    LIKELY=True:     6.44  6.44
>    LIKELY=False:    7.52  8.06
>  MISMATCHES=True:   8.59  8.59
> N=10:
>  MISMATCHES=False:  17.19  19.20
>  MISMATCHES=True:   36.52  37.59
>
> To be clear, "intern" is an interned "char*" (comparison with a 64 bits
> global variable), while key is comparison of a size_t (comparison of a
> 64-bit immediate in the instruction stream).
>
> PRELIMINARY BENCHMARK CONCLUSION:
>
> Intern appears to be as fast or faster than strcmp.
>
> I don't know why (is the pointer offset to the global variable stored in
> less than 64 bits in the x86-64 instruction stream? What gdb (or other)
> commands would I use to figure that out?)

I don't know why. It's entirely possible that this is just an accident
of alignment or something. You're probably using, what, a 2 GHz CPU or
so? So we're talking about a difference on the order of 2-4 cycles.
(Actually, I'm surprised that LIKELY made any difference. The CPU
knows which branch you're going to take regardless; all the compiler
can do is try to improve memory locality for the "expected" path. But
your entire benchmark probably fits in L1, so why would memory
locality matter? Unless you got unlucky with cache associativity or
something...)

Generally, I think the conclusion I draw from these numbers is that in
the hot-cache case, the lookup overhead is negligible regardless of
how we do it. On my laptop (i7 L640 @ 2.13 GHz), a single L3 cache
miss costs 150 ns, and even an L2 miss is 30 ns.

-- Nathaniel