[Cython] CEP1000: Native dispatch through callables

Tue Apr 17 16:34:48 CEST 2012

On 04/17/2012 04:20 PM, Nathaniel Smith wrote:
> On Tue, Apr 17, 2012 at 1:24 PM, Dag Sverre Seljebotn
> <d.s.seljebotn at astro.uio.no>  wrote:
>> OK, here's the benchmark code I've written:
>>
>> https://github.com/dagss/cep1000
>
> This is great!
>
>> Assumptions etc.:
>>
>>   - (Very) warm cache case is tested
>>
>>   - I compile and link libmycallable.so, libmycaller.so and ./bench; with
>> -fPIC, to emulate the Python environment
>>
>>   - I use mostly pure C but use PyTypeObject in order to get the offsets
>> to tp_flags etc right (I emulate the checking that would happen on a
>> PyObject* according to CEP1000).
>>
>>   - The test function is "double f(double x) { return x * x; }
>>
>>   - The benchmark is run in a loop J=1000000 times (and time divided by
>> J). This is repeated K=10000 times and the minimum walltime of the K run
>> is used. This gave very stable readings on my system.
>>
>> Fixing loop iterations:
>>
>> In the initial results I just scanned the overload list until
>> NULL-termination. It seemed to me that the code generated for this
>> scanning was the most important factor.
>>
>> Therefore I fixed the number of overloads as a known compile-time macro
>> N *in the caller*. This is somewhat optimistic; however I didn't want to
>> play with figuring out loop unrolling etc. at the same time, and
>> hardcoding the length of the overload list sort of took that part out of
>> the equation.
>
> Since you've set this up... I have a suggestion for something that may
> be worth trying, though I've hesitated to propose it seriously. And
> that is, an API where instead of scanning a table, the C-callable
> exposes a pointer-to-function, something like
>    int get_funcptr(PyObject * self, PyBytesObject * signature, struct
> c_function_info * out)

Hmm. There's many ways to implement that function though. It shifts the 
scanning logic from the caller to the callee; you would need to call it 
multiple times for different signatures...

But if the overhead can be shown to be miniscule then it does perhaps 
make the API nicer, even if it feels like paying for nothing at the 
moment. But see below.

Will definitely not get around to this today; anyone else feel free...

>
> The rationale is, if we want to support JITed functions where new
> function pointers may be generated on the fly, the array approach has
> a serious problem. You have to decide how many array slots to allocate
> ahead of time, and if you run out, then... too bad. I guess you get to

Note that the table is jumped to by a pointer in the PyObject, i.e. the 
PyObject I've tested with is

[object data, &table, table]

So a JIT could have the table in a separate location on the heap, then 
it can allocate a new table, copy over the contents, and when everything 
is ready, then do an atomic pointer update (using the assembly 
instructions/gcc intrinsics, not pthreads or locking).

The old table would need to linger for a bit, but could at latest be 
deallocated when the PyObject is deallocated.

> throw away one of the existing pointers (i.e., leak memory) to make
> room. Adding an indirection here for the actual lookup means that a
> JIT could use a more complex lookup structure if justified, while a
> simple C function pointer could just hardcode this to a
> signature-check + return, with no lookup step at all. It also would
> give us a lot of flexibility for future optimizations (e.g., is it
> worth sorting the lookup table in LRU order?). And it would allow for
> a JIT to generate a C function pointer on first use, rather than
> requiring the first use to go via the Python-level __call__ fallback.
> (Which is pretty important when the first use is to fetch the function
> pointer before entering an inner loop!)

What I was thinking about along these lines is to add another function 
pointer in the PyUnofficialTypeObject. A caller would then:

if found in table:
    do dispatch
else if object supports get_funcptr:
    call get_funcptr
else:
    python dispatch

>
> OTOH the extra indirection will obviously have some overhead, so it'd
> be nice to know if it's actually a problem.
>
>> Table explanation:
>>
>>   - N: Number of overloads in list. For N=10, there's 9 non-matching
>> overloads in the list before the matching 10 (but caller doesn't know
>> this). For N=1, the caller knows this and optimize for a hit in the
>> first entry.
>>
>>   - MISMATCHES: If set, the caller tries 4 non-matching signatures before
>> hitting the final one. If not set, only the correct signature is tried.
>>
>>   - LIKELY: If set, a GCC likely() macro is used to expect that the signature
>> matches.
>>
>>
>> RESULTS:
>>
>> Direct call (and execution of!) the function in benchmark loop took 4.8 ns.
>>
>> An indirect dispatch through a function pointer of known type took 5.4 ns
>>
>> Notation below is (intern key), in ns
>>
>> N=1:
>>   MISMATCHES=False:
>>     LIKELY=True:     6.44  6.44
>>     LIKELY=False:    7.52  8.06
>>   MISMATCHES=True:   8.59  8.59
>> N=10:
>>   MISMATCHES=False:  17.19  19.20
>>   MISMATCHES=True:   36.52  37.59
>>
>> To be clear, "intern" is an interned "char*" (comparison with a 64 bits
>> global variable), while key is comparison of a size_t (comparison of a
>> 64-bit immediate in the instruction stream).
>>
>> PRELIMINARY BENCHMARK CONCLUSION:
>>
>> Intern appears to be as fast or faster than strcmp.
>>
>> I don't know why (is the pointer offset to the global variable stored in
>> less than 64 bits in the x86-64 instruction stream? What gdb (or other)
>> commands would I use to figure that out?)
>
> I don't know why. It's entirely possible that this is just an accident
> of alignment or something. You're probably using, what, a 2 GHz CPU or
> so? So we're talking about a difference on the order of 2-4 cycles.
> (Actually, I'm surprised that LIKELY made any difference. The CPU
> knows which branch you're going to take regardless; all the compiler
> can do is try to improve memory locality for the "expected" path. But
> your entire benchmark probably fits in L1, so why would memory
> locality matter? Unless you got unlucky with cache associativity or
> something...)

Oh, sorry:

~ $ cat /proc/cpuinfo
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 30
model name	: Intel(R) Core(TM) i7 CPU       Q 840  @ 1.87GHz
stepping	: 5
cpu MHz		: 1866.000
cache size	: 8192 KB
...

Dag