[Cython] CEP1000: Native dispatch through callables

Sun Apr 15 07:59:01 CEST 2012

On Sat, Apr 14, 2012 at 2:00 PM, mark florisson
<markflorisson88 at gmail.com> wrote:
> On 14 April 2012 20:08, Dag Sverre Seljebotn <d.s.seljebotn at astro.uio.no> wrote:
>> On 04/13/2012 12:11 AM, Dag Sverre Seljebotn wrote:
>>>
>>> Travis Oliphant recently raised the issue on the NumPy list of what
>>> mechanisms to use to box native functions produced by his Numba so that
>>> SciPy functions can call it, e.g. (I'm making the numba part up):
>>
>> <snip>
>>
>> This thread is turning into one of those big ones...
>>
>> But I think it is really worth it in the end; I'm getting excited about the
>> possibility down the road of importing functions using normal Python
>> mechanisms and still have fast calls.
>>
>> Anyway, to organize discussion I've tried to revamp the CEP and describe
>> both the intern-way and the strcmp-way.
>>
>> The wiki appears to be down, so I'll post it below...
>>
>> Dag
>>
>> = CEP 1000: Convention for native dispatches through Python callables =
>>
>> Many callable objects are simply wrappers around native code.  This
>> holds for any Cython function, f2py functions, manually
>> written CPython extensions, Numba, etc.
>>
>> Obviously, when native code calls other native code, it would be
>> nice to skip the significant cost of boxing and unboxing all the arguments.
>> Early binding at compile-time is only possible
>> between different Cython modules, not between all the tools
>> listed above.
>>
>> [[enhancements/nativecall|CEP 523]] deals with Cython-specific aspects
>> (and is out-of-date w.r.t. this CEP); this CEP is intended to be about
>> a cross-project convention only. If a success, this CEP may be
>> proposesd as a PEP in a modified form.
>>
>> Motivating example (looking a year or two into the future):
>>
>> {{{
>> @numba
>> def f(x): return 2 * x
>>
>> @cython.inline
>> def g(x : cython.double): return 3 * x
>>
>> from fortranmod import h
>>
>> print f(3)
>> print g(3)
>> print h(3)
>> print scipy.integrate.quad(f, 0.2, 3) # fast callback!
>> print scipy.integrate.quad(g, 0.2, 3) # fast callback!
>> print scipy.integrate.quad(h, 0.2, 3) # fast callback!
>>
>> }}}
>>
>> == The native-call slot ==
>>
>> We need ''fast'' access to probing whether a callable object supports
>> this CEP.  Other mechanisms, such as an attribute in a dict, is too
>> slow for many purposes (quoting robertwb: "We're trying to get a 300ns
>> dispatch down to 10ns; you do not want a 50ns dict lookup"). (Obviously,
>> if you call a callable in a loop you can fetch the pointer outside
>> of the loop. But in particular if this becomes a language feature
>> in Cython it will be used in all sorts of places.)
>>
>> So we hack another type slot into existing and future CPython
>> implementations in the following way: This CEP provides a C header
>> that for all Python versions define a macro
>> {{{Py_TPFLAGS_UNOFFICIAL_EXTRAS}}} for a free bit in
>> {{{tp_flags}}} in the {{{PyTypeObject}}}.
>>
>> If present, then we extend {{{PyTypeObject}}}
>> as follows:
>> {{{
>> typedef struct {
>>    PyTypeObject tp_main;
>>    size_t tp_unofficial_flags;
>>    size_t tp_nativecall_offset;
>> } PyUnofficialTypeObject;
>> }}}
>>
>> {{{tp_unofficial_flags}}} is unused and should be all 0 for the time
>> being, but can be used later to indicate features beyond this CEP.
>>
>> If {{{tp_nativecall_offset != 0}}}, this CEP is supported, and
>> the information for doing a native dispatch on a callable {{{obj}}}
>> is located at
>> {{{
>> (char*)obj + ((PyUnofficialTypeObject*)obj->ob_type)->tp_nativecall_offset;
>> }}}
>>
>> === GIL-less accesss ===
>>
>> It is OK to access the native-call table without holding the GIL. This
>> should of course only be used to call functions that state in their
>> signature that they don't need the GIL.
>>
>> This is important for JITted callables who would like to rewrite their
>> table as more specializations gets added; if one needs to reallocate
>> the table, the old table must linger along long enough that all
>> threads that are currently accessing it are done with it.
>>
>> == Native dispatch descriptor ==
>>
>> The final format for the descriptor is not agreed upon yet; this sums
>> up the major alternatives.
>>
>> The descriptor should be a list of specializations/overload, each
>> described by a function pointer and a signature specification
>> string, such as "id)i" for {{{int f(int, double)}}}.
>>
>> The way it is stored must cater for two cases; first, when the caller
>> expects one or more hard-coded signatures:
>> {{{
>> if (obj has signature "id)i") {
>>    call;
>> } else if (obj has signature "if)i") {
>>    call with promoted second argument;
>> } else {
>>    box all arguments;
>>    PyObject_Call;
>> }
>> }}}
>
> There may be a lot of promotion/demotion (you likely only want the
> former) combinations, especially for multiple arguments, so perhaps it
> makes sense to limit ourselves a bit. For instance for numeric scalar
> argument types we could limit to long (and the unsigned counterparts),
> double and double complex.
>
> So char, short and int scalars will be
> promoted to long, float to double and float complex to double complex.
> Anything bigger, like long long etc will be matched specifically.
> Promotions and associated demotions if necessary in the callee should
> be fairly cheap compared to checking all combinations or going through
> the python layer.

True, though this could be a convention rather than a requirement of
the spec. Long vs. < long seems natural, but are there any systems
where (scalar) float still has an advantage over double?

Of course pointers like float* vs double* can't be promoted, so we
would still need this kind of type declaration.

>> The second is when a call stack is built dynamically while parsing the
>> string. Since this has higher overhead anyway, optimizing for the first
>> case makes sense.
>>
>> === Approach 1: Interning/run-time allocated IDs ===
>>
>>
>> 1A: Let each overload have a struct
>> {{{
>> struct {
>>    size_t signature_id;
>>    char *signature;
>>    void *func_ptr;
>> };
>> }}}
>> Within each process run, there is a 1:1 between {{{signature}}} and
>> {{{signature_id}}}. {{{signature_id}}} is allocated by some central
>> registry.
>>
>> 1B: Intern the string instead:
>> {{{
>> struct {
>>    char *signature; /* pointer must come from the central registry */
>>    void *func_ptr;
>> };
>> }}}
>> However this is '''not'' trivial, since signature strings can
>> be allocated on the heap (e.g., a JIT would do this), so interned strings
>> must be memory managed and reference counted. This could be done by
>> each object passing in the signature '''both''' when incref-ing and
>> decref-ing the signature string in the interning machinery.
>> Using Python {{{bytes}}} objects is another option.
>>
>> ==== Discussion ====
>>
>> '''The cost of comparing a signature''': Comparing a global variable
>> (needle)
>> to a value that is guaranteed to already be in cache (candidate match)
>>
>> '''Pros:'''
>>
>>  * Conceptually simple struct format.
>>
>> '''Cons:'''
>>
>>  * Requires a registry for interning strings. This must be
>>   "handshaked" between the implementors of this CEP (probably by
>>   "first to get at {{{sys.modules["_nativecall"}}} sticks it there),
>>   as we can't ship a common dependency library for this CEP.
>>
>> === Approach 2: Efficient strcmp of verbatim signatures ===
>>
>> The idea is to store the full signatures and the function pointers together
>> in the same memory area, but still have some structure to allow for quick
>> scanning through the list.
>>
>> Each entry has the structure {{{[signature_string, funcptr]}}}
>> where:
>>
>>  * The signature string has variable length, but the length is
>>   divisible by 8 bytes on all platforms. The {{{funcptr}}} is always
>>   8 bytes (it is padded on 32-bit systems).
>>
>>  * The total size of the entry should be divisible by 16 bytes (= the
>>   signature data should be 8 bytes, or 24 bytes, or...)
>>
>>  * All but the first chunk of signature data should start with a
>>   continuation character "-", i.e. a really long signature string
>>   could be {{{"iiiidddd-iiidddd-iiidddd-)d"}}}. That is, a "-" is
>>   inserted on all positions in the string divisible by 8, except the
>>   first.
>>
>> The point is that if you know a signature, you can quickly scan
>> through the binary blob for the signature in 128 bit increments,
>> without worrying about the variable size nature of each entry.  The
>> rules above protects against spurious matches.

Note that these two approaches need not be mutually exclusive; a
cutoff could be established giving the best (and worst) of both.

>> ==== Optional: Encoding ====
>>
>> The strcmp approach can be made efficient for larger signatures by
>> using a more efficient encoding than ASCII. E.g., an encoding could
>> use 4 bits for the 12 most common symbols and 8 bits
>> for 64 symbols (for a total of 78 symbols), of which some could be
>> letter combinations ("Zd", "T{"). This should be reasonably simple
>> to encode and decode.
>>
>> The CEP should provide C routines in a header file to work with the
>> signatures. Callers that wish to parse the format string and build a
>> call stack on the fly should probably work with the encoded
>> representation.
>>
>> ==== Discussion ====
>>
>> '''The cost of comparing a signature''': For the vast majority of
>> functions, the cost is comparing a 64-bit number stored in the CPU
>> instruction stream (needle) to a value that is guaranteed to already
>> be in cache (candidate match).
>>
>> '''Pros:'''
>>
>>  * Readability-wise, one can use the C switch statement to dispatch
>>
>>  * "Stateless data", for compiled code it does not require any
>>   run-time initialization like interning does
>>
>>  * One less pointer-dereference in the common case of a short
>>   signature
>>
>> '''Cons:'''
>>
>>  * Long signatures will require more than 8 bytes to store and could
>>   thus be more expensive than interned strings
>>
>>  * Format looks uglier in the form of literals in C source code
>>
>>
>> == Signature strings ==
>>
>> Example: The function
>> {{{
>> int f(double x, float y);
>> }}}
>> would have the signature string {{{"df)i"}}} (or, to save space,
>> {{{"idf"}}}).
>>
>> Fields would follow the PEP3118 extensions of the struct-module format
>> string, but with some modifications:
>>
>>  * The format should be canonical and fit for {{{strcmp}}}-like
>>   comparison: No whitespace, no field names (TBD: what else?)
>
> I think alignment is also a troublemaker. Maybe we should allow '@'
> (which cannot appear in the character string but will be the default,
> that is native size, alignment and byteorder) and '^', unaligned
> native size and byteorder (to be used for packed structs).
>
>>  * TBD: Information about GIL requirements (nogil, with gil?), how
>>   exceptions are reported
>
> Maybe that could be a separate list, to be consulted mostly for
> explicit casts (I think PyErr_Occurred() would be the default for
> non-object return types).
>
>>  * TBD: Support for Cython-specific constructs like memoryview slices
>>   (so that arrays with strides and shape can be passed faster than
>>   passing an {{{"O"}}}).
>
> Definitely, maybe something simple like M{1f}, for a 1D memoryview
> slice of floats.

It would certainly be useful to have special syntax for memory views
(after nailing down a well-defined ABI for them) and builtin types.
Being able to declare something as taking a
"sage.rings.integer.Integer" could also prove useful, but could result
in long (and prefix-sharing) signatures, favoring the
runtime-allocated ids.