[Cython] CEP1000: Native dispatch through callables

Stefan Behnel stefan_ml at behnel.de
Sat Apr 14 23:02:05 CEST 2012


Hi,

thanks for writing this up. Comments inline as I read through it.

Dag Sverre Seljebotn, 14.04.2012 21:08:
> === GIL-less accesss ===
> 
> It is OK to access the native-call table without holding the GIL. This
> should of course only be used to call functions that state in their
> signature that they don't need the GIL.
> 
> This is important for JITted callables who would like to rewrite their
> table as more specializations gets added; if one needs to reallocate
> the table, the old table must linger along long enough that all
> threads that are currently accessing it are done with it.

The problem here is that changing the table in the face of threaded access
is very likely to introduce race conditions, and the average library out
there won't know when all threads are done with it. I don't think later
modification is a good idea.


> == Native dispatch descriptor ==
> 
> The final format for the descriptor is not agreed upon yet; this sums
> up the major alternatives.
> 
> The descriptor should be a list of specializations/overload

While overloaded signatures are great for the callee, they make things much
more complicated for the caller. It's no longer just one signature that
either matches or not. Especially when we allow more than one expected
signature, then each of them has to be compared against all exported
signatures.

We'll have to see what the runtime impact and the impact on the code
complexity is, I guess.


> each described by a function pointer and a signature specification
> string, such as "id)i" for {{{int f(int, double)}}}.

How do we deal with object argument types? Do we care on the caller side?
Functions might have alternative signatures that differ in the type of
their object parameters. Or should we handle this inside of the caller and
expect that it's something like a fused function with internal dispatch in
that case?

Personally, I think there is not enough to gain from object parameters that
we should handle it on the caller side. The callee can dispatch those if
necessary.

What about signatures that require an object when we have a C typed value?

What about signatures that require a C typed argument when we have an
arbitrary object value in our call parameters?

We should also strip the "self" argument from the parameter list of
methods. That's handled by the attribute lookup before even getting at the
callable.


> === Approach 1: Interning/run-time allocated IDs ===
> 
> 
> 1A: Let each overload have a struct
> {{{
> struct {
>     size_t signature_id;
>     char *signature;
>     void *func_ptr;
> };
> }}}
> Within each process run, there is a 1:1

mapping/relation

> between {{{signature}}} and
> {{{signature_id}}}. {{{signature_id}}} is allocated by some central
> registry.
> 
> 1B: Intern the string instead:
> {{{
> struct {
>     char *signature; /* pointer must come from the central registry */
>     void *func_ptr;
> };
> }}}
> However this is '''not'' trivial, since signature strings can
> be allocated on the heap (e.g., a JIT would do this), so interned strings
> must be memory managed and reference counted.

Not necessarily, they are really short strings that could just live
forever, stored efficiently by the registry in a series of larger memory
blocks. It would take a while to fill up enough memory with those to become
problematic. Finding an efficiently lookup scheme for them might become
interesting at some point, but that would also take a while.

I don't expect real-world systems to have to deal with thousands of
different runtime(!) discovered signatures during one interpreter lifetime.


> ==== Discussion ====
> 
> '''The cost of comparing a signature''': Comparing a global variable (needle)
> to a value that is guaranteed to already be in cache (candidate match)
> 
> '''Pros:'''
> 
>  * Conceptually simple struct format.
> 
> '''Cons:'''
> 
>  * Requires a registry for interning strings. This must be
>    "handshaked" between the implementors of this CEP (probably by
>    "first to get at {{{sys.modules["_nativecall"}}} sticks it there),
>    as we can't ship a common dependency library for this CEP.

... which would eventually end up in the stdlib, but could equally well
come from PyPI for now. I don't see a problem with that.

Using sys.modules (or another global store) instead of an explicit import
allows for dependency injection, that's good.


> === Approach 2: Efficient strcmp of verbatim signatures ===
> 
> The idea is to store the full signatures and the function pointers together
> in the same memory area, but still have some structure to allow for quick
> scanning through the list.
> 
> Each entry has the structure {{{[signature_string, funcptr]}}}
> where:
> 
>  * The signature string has variable length, but the length is
>    divisible by 8 bytes on all platforms. The {{{funcptr}}} is always
>    8 bytes (it is padded on 32-bit systems).
> 
>  * The total size of the entry should be divisible by 16 bytes (= the
>    signature data should be 8 bytes, or 24 bytes, or...)
> 
>  * All but the first chunk of signature data should start with a
>    continuation character "-", i.e. a really long signature string
>    could be {{{"iiiidddd-iiidddd-iiidddd-)d"}}}. That is, a "-" is
>    inserted on all positions in the string divisible by 8, except the
>    first.
> 
> The point is that if you know a signature, you can quickly scan
> through the binary blob for the signature in 128 bit increments,
> without worrying about the variable size nature of each entry.  The
> rules above protects against spurious matches.

Sounds pretty fast to me. Absolutely worth trying. And if we store the
signature we compare against in the same format, we won't have to parse the
signature string as such, we can really just compare the numeric values.
Assuming that's really fast, that would allow the callee to optimistically
export additional signatures, e.g. with compatible subtypes or easily
coercible types, ordered by the expected overhead of processing the
arguments (and the expected probability of being called), so that the
caller would automatically hit the fastest call path first when traversing
the list from start to end. The number of possible signatures would
obviously explode at some point...

Note that JITs could still be smart enough to avoid the traversal after a
few loop iterations.

One problem: if any of the call parameters is a plain object type, identity
matches may not work anymore because we won't know what signature to expect.


> ==== Optional: Encoding ====
> 
> The strcmp approach can be made efficient for larger signatures by
> using a more efficient encoding than ASCII. E.g., an encoding could
> use 4 bits for the 12 most common symbols and 8 bits
> for 64 symbols (for a total of 78 symbols), of which some could be
> letter combinations ("Zd", "T{"). This should be reasonably simple
> to encode and decode.
> 
> The CEP should provide C routines in a header file to work with the
> signatures. Callers that wish to parse the format string and build a
> call stack on the fly should probably work with the encoded
> representation.

Huffman codes can be processed bitwise from start to end, that would work.

However, this would quickly die when we start adding arbitrary object
types. That would require a global registry for user types again. A reason
not to care about object types at the caller.

Also, how do we encode struct/union argument types?


> ==== Discussion ====
> 
> '''The cost of comparing a signature''': For the vast majority of
> functions, the cost is comparing a 64-bit number stored in the CPU
> instruction stream (needle) to a value that is guaranteed to already
> be in cache (candidate match).
> 
> '''Pros:'''
> 
>  * Readability-wise, one can use the C switch statement to dispatch
> 
>  * "Stateless data", for compiled code it does not require any
>    run-time initialization like interning does
> 
>  * One less pointer-dereference in the common case of a short
>    signature
> 
> '''Cons:'''
> 
>  * Long signatures will require more than 8 bytes to store and could
>    thus be more expensive than interned strings

We could also ignore trailing arguments and only dispatch based on a fixed
number of first arguments. Callees with more arguments would then simply
not export native signatures.


>  * Format looks uglier in the form of literals in C source code

They are not meant for reading, and we can always generate a comment with a
spelled-out readable signature next to it.


> == Signature strings ==
> 
> Example: The function
> {{{
> int f(double x, float y);
> }}}
> would have the signature string {{{"df)i"}}} (or, to save space, {{{"idf"}}}).
> 
> Fields would follow the PEP3118 extensions of the struct-module format
> string, but with some modifications:
> 
>  * The format should be canonical and fit for {{{strcmp}}}-like
>    comparison: No whitespace, no field names (TBD: what else?)
> 
>  * TBD: Information about GIL requirements (nogil, with gil?), how
>    exceptions are reported

What about C++, including C++ exceptions?


>  * TBD: Support for Cython-specific constructs like memoryview slices
>    (so that arrays with strides and shape can be passed faster than
>    passing an {{{"O"}}}).

Is this really Cython specific or would a generic Py_buffer struct work?

Stefan


More information about the cython-devel mailing list