[Cython] CEP1000: Native dispatch through callables

Sat Apr 14 21:08:13 CEST 2012

On 04/13/2012 12:11 AM, Dag Sverre Seljebotn wrote:
> Travis Oliphant recently raised the issue on the NumPy list of what
> mechanisms to use to box native functions produced by his Numba so that
> SciPy functions can call it, e.g. (I'm making the numba part up):
<snip>

This thread is turning into one of those big ones...

But I think it is really worth it in the end; I'm getting excited about 
the possibility down the road of importing functions using normal Python 
mechanisms and still have fast calls.

Anyway, to organize discussion I've tried to revamp the CEP and describe 
both the intern-way and the strcmp-way.

The wiki appears to be down, so I'll post it below...

Dag

= CEP 1000: Convention for native dispatches through Python callables =

Many callable objects are simply wrappers around native code.  This
holds for any Cython function, f2py functions, manually
written CPython extensions, Numba, etc.

Obviously, when native code calls other native code, it would be
nice to skip the significant cost of boxing and unboxing all the arguments.
Early binding at compile-time is only possible
between different Cython modules, not between all the tools
listed above.

[[enhancements/nativecall|CEP 523]] deals with Cython-specific aspects
(and is out-of-date w.r.t. this CEP); this CEP is intended to be about
a cross-project convention only. If a success, this CEP may be
proposesd as a PEP in a modified form.

Motivating example (looking a year or two into the future):

{{{
@numba
def f(x): return 2 * x

@cython.inline
def g(x : cython.double): return 3 * x

from fortranmod import h

print f(3)
print g(3)
print h(3)
print scipy.integrate.quad(f, 0.2, 3) # fast callback!
print scipy.integrate.quad(g, 0.2, 3) # fast callback!
print scipy.integrate.quad(h, 0.2, 3) # fast callback!

}}}

== The native-call slot ==

We need ''fast'' access to probing whether a callable object supports
this CEP.  Other mechanisms, such as an attribute in a dict, is too
slow for many purposes (quoting robertwb: "We're trying to get a 300ns
dispatch down to 10ns; you do not want a 50ns dict lookup"). (Obviously,
if you call a callable in a loop you can fetch the pointer outside
of the loop. But in particular if this becomes a language feature
in Cython it will be used in all sorts of places.)

So we hack another type slot into existing and future CPython
implementations in the following way: This CEP provides a C header
that for all Python versions define a macro
{{{Py_TPFLAGS_UNOFFICIAL_EXTRAS}}} for a free bit in
{{{tp_flags}}} in the {{{PyTypeObject}}}.

If present, then we extend {{{PyTypeObject}}}
as follows:
{{{
typedef struct {
     PyTypeObject tp_main;
     size_t tp_unofficial_flags;
     size_t tp_nativecall_offset;
} PyUnofficialTypeObject;
}}}

{{{tp_unofficial_flags}}} is unused and should be all 0 for the time
being, but can be used later to indicate features beyond this CEP.

If {{{tp_nativecall_offset != 0}}}, this CEP is supported, and
the information for doing a native dispatch on a callable {{{obj}}}
is located at
{{{
(char*)obj + ((PyUnofficialTypeObject*)obj->ob_type)->tp_nativecall_offset;
}}}

=== GIL-less accesss ===

It is OK to access the native-call table without holding the GIL. This
should of course only be used to call functions that state in their
signature that they don't need the GIL.

This is important for JITted callables who would like to rewrite their
table as more specializations gets added; if one needs to reallocate
the table, the old table must linger along long enough that all
threads that are currently accessing it are done with it.

== Native dispatch descriptor ==

The final format for the descriptor is not agreed upon yet; this sums
up the major alternatives.

The descriptor should be a list of specializations/overload, each
described by a function pointer and a signature specification
string, such as "id)i" for {{{int f(int, double)}}}.

The way it is stored must cater for two cases; first, when the caller
expects one or more hard-coded signatures:
{{{
if (obj has signature "id)i") {
     call;
} else if (obj has signature "if)i") {
     call with promoted second argument;
} else {
     box all arguments;
     PyObject_Call;
}
}}}
The second is when a call stack is built dynamically while parsing the
string. Since this has higher overhead anyway, optimizing for the first
case makes sense.

=== Approach 1: Interning/run-time allocated IDs ===

1A: Let each overload have a struct
{{{
struct {
     size_t signature_id;
     char *signature;
     void *func_ptr;
};
}}}
Within each process run, there is a 1:1 between {{{signature}}} and
{{{signature_id}}}. {{{signature_id}}} is allocated by some central
registry.

1B: Intern the string instead:
{{{
struct {
     char *signature; /* pointer must come from the central registry */
     void *func_ptr;
};
}}}
However this is '''not'' trivial, since signature strings can
be allocated on the heap (e.g., a JIT would do this), so interned strings
must be memory managed and reference counted. This could be done by
each object passing in the signature '''both''' when incref-ing and
decref-ing the signature string in the interning machinery.
Using Python {{{bytes}}} objects is another option.

==== Discussion ====

'''The cost of comparing a signature''': Comparing a global variable 
(needle)
to a value that is guaranteed to already be in cache (candidate match)

'''Pros:'''

  * Conceptually simple struct format.

'''Cons:'''

  * Requires a registry for interning strings. This must be
    "handshaked" between the implementors of this CEP (probably by
    "first to get at {{{sys.modules["_nativecall"}}} sticks it there),
    as we can't ship a common dependency library for this CEP.

=== Approach 2: Efficient strcmp of verbatim signatures ===

The idea is to store the full signatures and the function pointers together
in the same memory area, but still have some structure to allow for quick
scanning through the list.

Each entry has the structure {{{[signature_string, funcptr]}}}
where:

  * The signature string has variable length, but the length is
    divisible by 8 bytes on all platforms. The {{{funcptr}}} is always
    8 bytes (it is padded on 32-bit systems).

  * The total size of the entry should be divisible by 16 bytes (= the
    signature data should be 8 bytes, or 24 bytes, or...)

  * All but the first chunk of signature data should start with a
    continuation character "-", i.e. a really long signature string
    could be {{{"iiiidddd-iiidddd-iiidddd-)d"}}}. That is, a "-" is
    inserted on all positions in the string divisible by 8, except the
    first.

The point is that if you know a signature, you can quickly scan
through the binary blob for the signature in 128 bit increments,
without worrying about the variable size nature of each entry.  The
rules above protects against spurious matches.

==== Optional: Encoding ====

The strcmp approach can be made efficient for larger signatures by
using a more efficient encoding than ASCII. E.g., an encoding could
use 4 bits for the 12 most common symbols and 8 bits
for 64 symbols (for a total of 78 symbols), of which some could be
letter combinations ("Zd", "T{"). This should be reasonably simple
to encode and decode.

The CEP should provide C routines in a header file to work with the
signatures. Callers that wish to parse the format string and build a
call stack on the fly should probably work with the encoded
representation.

==== Discussion ====

'''The cost of comparing a signature''': For the vast majority of
functions, the cost is comparing a 64-bit number stored in the CPU
instruction stream (needle) to a value that is guaranteed to already
be in cache (candidate match).

'''Pros:'''

  * Readability-wise, one can use the C switch statement to dispatch

  * "Stateless data", for compiled code it does not require any
    run-time initialization like interning does

  * One less pointer-dereference in the common case of a short
    signature

'''Cons:'''

  * Long signatures will require more than 8 bytes to store and could
    thus be more expensive than interned strings

  * Format looks uglier in the form of literals in C source code

== Signature strings ==

Example: The function
{{{
int f(double x, float y);
}}}
would have the signature string {{{"df)i"}}} (or, to save space, 
{{{"idf"}}}).

Fields would follow the PEP3118 extensions of the struct-module format
string, but with some modifications:

  * The format should be canonical and fit for {{{strcmp}}}-like
    comparison: No whitespace, no field names (TBD: what else?)

  * TBD: Information about GIL requirements (nogil, with gil?), how
    exceptions are reported

  * TBD: Support for Cython-specific constructs like memoryview slices
    (so that arrays with strides and shape can be passed faster than
    passing an {{{"O"}}}).