[Cython] CEP1000: Native dispatch through callables

Sat May 5 19:13:04 CEST 2012

On 5 May 2012 17:27, Dag Sverre Seljebotn <d.s.seljebotn at astro.uio.no> wrote:
> On 05/05/2012 01:08 PM, mark florisson wrote:
>>
>> On 3 May 2012 13:24, Dag Sverre Seljebotn<d.s.seljebotn at astro.uio.no>
>>  wrote:
>>>
>>> I'm afraid I'm going to try to kick this thread alive again. I want us to
>>> have something that Travis can implement in numba and "his" portion of
>>> SciPy, and also that could be used by NumPy devs.
>>>
>>> Since the decisions are rather arbitrary, perhaps we can try to quickly
>>> get
>>> to the "+1" stage (or, depending on how things turn out, a tournament
>>> starting with at most one proposal per person).
>>>
>>>
>>> On 04/20/2012 09:30 AM, Robert Bradshaw wrote:
>>>>
>>>>
>>>> On Thu, Apr 19, 2012 at 6:18 AM, Dag Sverre Seljebotn
>>>> <d.s.seljebotn at astro.uio.no>    wrote:
>>>>>
>>>>>
>>>>> On 04/19/2012 01:20 PM, Nathaniel Smith wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 19, 2012 at 11:56 AM, Dag Sverre Seljebotn
>>>>>> <d.s.seljebotn at astro.uio.no>      wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I thought of some drawbacks of getfuncptr:
>>>>>>>
>>>>>>>  - Important: Doesn't allow you to actually inspect the supported
>>>>>>> signatures, which is needed (or at least convenient) if you want to
>>>>>>> use
>>>>>>> an
>>>>>>> FFI library or do some JIT-ing. So an iteration mechanism is still
>>>>>>> needed
>>>>>>> in
>>>>>>> addition, meaning the number of things for the object to implement
>>>>>>> grows
>>>>>>> a
>>>>>>> bit large. Default implementations help -- OTOH there really wasn't a
>>>>>>> major
>>>>>>> drawback with the table approach as long as JIT's can just replace
>>>>>>> it?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> But this is orthogonal to the table vs. getfuncptr discussion. We're
>>>>>> assuming that the table might be extended at runtime, which means you
>>>>>> can't use it to determine which signatures are supported. So we need
>>>>>> some sort of extra interface for the caller and callee to negotiate a
>>>>>> type anyway. (I'm intentionally agnostic about whether it makes more
>>>>>> sense for the caller or the callee to be doing the iterating... in
>>>>>> general type negotiation could be quite complicated, and I don't think
>>>>>> we know enough to get that interface right yet.)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Hmm. Right. Let's define an explicit goal for the CEP then.
>>>>>
>>>>> What I care about at is getting the spec right enough such that, e.g.,
>>>>> NumPy
>>>>> and SciPy, and other (mostly manually written) C extensions with slow
>>>>> development pace, can be forward-compatible with whatever crazy things
>>>>> Cython or Numba does.
>>>>>
>>>>> There's 4 cases:
>>>>>
>>>>>  1) JIT calls JIT (ruled out straight away)
>>>>>
>>>>>  2) JIT calls static: Say that Numba wants to optimize calls to np.sin
>>>>> etc.
>>>>> without special-casing; this seem to require reading a table of static
>>>>> signatures
>>>>>
>>>>>  3) Static calls JIT: This is the case when scipy.integrate routines
>>>>> calls a
>>>>> Numba callback and Numba generates a specialization for the dtype they
>>>>> explicitly needs. This calls for getfuncptr (but perhaps in a form
>>>>> which
>>>>> we
>>>>> can't quite determine yet?).
>>>>>
>>>>>  4) Static calls static: Either table or getfuncptr works.
>>>>>
>>>>> My gut feeling is go for 2) and 4) in this round =>    table.
>>>>
>>>>
>>>>
>>>> getfuncptr is really simple and flexible, but I'm with you on both of
>>>> these to points, and the overhead was not trivial.
>>>
>>>
>>>
>>> It's interesting to hear you say the overhead was not trivial (that was
>>> my
>>> hunch too but I sort of yielded to peer pressure). I think SAGE has some
>>> history with this -- isn't one of the reasons for the "cpdef" vs. "cdef"
>>> split that "cpdef" has the cost of a single lookup for the presence of a
>>> __dict__ on the object, which was an unacceptable penalty for parts of
>>> Sage?
>>> That can't have been much more than a 1ns penalty per instance.
>>>
>>>
>>>> Of course we could offer both, i.e. look at the table first, if it's
>>>> not there call getfuncptr if it's non-null, then fall back to "slow"
>>>> call or error. These are all opt-in depending on how hard you want to
>>>> try to optimize things.
>>>
>>>
>>>
>>> That's actually exactly what I was envisioning -- in time (with JITs on
>>> both
>>> ends) the table could act sort of as a cache for commonly used overloads,
>>> and getfuncptr would access the others more slowly.
>>>
>>>
>>>> As far as keys vs. interning, I'm also tempted to try to have my cake
>>>> and eat it too. Define a space-friendly encoding for signatures and
>>>> require interning for anything that doesn't fit into a single
>>>> sizeof(void*). The fact that this cutoff would vary for 32 vs 64-bit
>>>> would require some care, but could be done with macros in C. If the
>>>> signatures produce non-aligned "pointer" values there won't be any
>>>> collisions, and this way libraries only have to share in the global
>>>> (Python-level?) interning scheme iff they want to expose/use "large"
>>>> signatures.
>>>
>>>
>>>
>>> That was the approach I described to Nathaniel as having the "worst
>>> features
>>> of both" -- lack of readable gdb dumps of the keys, and having to define
>>> an
>>> interning mechanism for use by the 5% cases that don't fit.
>>>
>>> To sum up hat's been said earlier: The only thing that would blow the key
>>> size above 64 bits except very many arguments would be things like
>>> classes/interfaces/vtables. But in that case, reasonable-sized keys for
>>> the
>>> vtables can be computed (whether by interning, cryptographic hashing, or
>>> a
>>> GUID like Microsoft COM).
>>>
>>> So I'm still +1 on my proposal; but I would be happy with an intern-based
>>> proposal if somebody bothers to flesh it out a bit (I don't quite know
>>> how
>>> I'd do it and would get lost in PyObject* vs. char* and cross-language
>>> state
>>> sharing...).
>>>
>>> My proposal in summary:
>>>
>>>  - Table with variable-sized entries (not getfuncptr, not interning) that
>>> can be scanned by the caller in 128-bit increments.
>>
>>
>> Hm, so the caller knows what kind of key it needs to compare to, so if
>> it has a 64 bits key then it won't need to compare 128 bits (padded
>> with zeroes?). But if it doesn't compare 128 bits, then it means 128
>> bit keys cannot have 64 bit keys as prefix. Will that be a problem, or
>
>
> Did you read the CEP? I also clarified this in a post in response to
> Nathaniel. The idea is that the scanner doesn't need to branch on the
> key-length anywhere. This requires a) making each key n*64 bits long where n
> is odd => function pointers are always at (m*128 + 64) bits from the start
> for some non-negative integer m, b) insert some protective prefix for every
> 128 bits in the key.
>
>

Oh sorry, I didn't read the updated CEP, you want arbitrarily sized
keys. I assumed you would hash any key larger than 64 bits to a 128
bit key (e.g. md5). For instance if you have a large(r) number of
signatures, some of which are complex (greater than 64 bits, so
hashed) and some of which are simple, then if you know the signature
you need in advance, you can either follow the pointer to the 128 bit
keys, or skip the pointer entirely and continue with the 64 bit keys.
I suppose the common case is a few signatures, in which case a linear
scan is likely faster in the 128 bit case (which is the uncommon
case).

>> would it make sense to make the first entry a pointer pointing to 128
>> bit keys, and the rest are all 64 bit keys (or even 32 bit keys and
>> two pointers)? e.g. a contiguous list of [128 bit key/pointer
>> list-pointer, 64-bit keys&  func pointers, 128 bit keys&  func
>> pointers, NULL]
>
>
> I don't really understand this description, but in general I'm sceptical
> about the pipelining abilities of pointer-chasing code. It may be OK, but it
> would require a benchmark, and if there's not a reason to have it...
>
>
>> Even with a naive encoding scheme you could encode 3 scalar arguments
>> and a return value in 32 bits (e.g. 'dddd'). That might be better on
>> x86?
>
>
> Me and Robert have been assuming some non-ASCII encoding that would allow
> many more arguments in 64 bits.
>

Sure. The point was that the worst encoding scheme can already serve
the common case.

>>
>>>  - Only use 64 bit pointers, in order to keep table format the same on 32
>>> bit and 64 bit.
>>
>>
>> Pointer to the function? I think that would only be harder to use than
>> native pointers?
>
>
> That was to make the multiple-of-128-bit-entry idea work without having to
> require that keys are different between 32 bits and 64 bits platforms.

Right, I got that now (reading the CEP is kind of mandatory :). Thanks.

> Dag
>
>
>>>  - Do encoding of the signature strings. Utility functions to work with
>>> this
>>> (both to scan tables and encode/decode a format string) will be provided
>>> as
>>> C code by the CEP that can be bundled.
>>>
>>> Pros:
>>>
>>>  - Table format is not specific to Python world (it makes as much sense
>>> to
>>> use, e.g., internally in Julia)
>>>
>>>  - No state needs to be shared between packages run-time (they can use
>>> the
>>> bundled C code in isolation if they wish)
>>>
>>>  - No need for an interning machinery
>>>
>>>  - More easily compatible with multiple interpreter states (?)
>>>
>>>  - Minor performance benefit of table over getfuncptr (intern vs. key
>>> didn't
>>> matter). [Cue comment that this doesn't matter.]
>>>
>>> Cons:
>>>
>>>  - Lack of instant low-level debuggability, like in the interned case (a
>>> human needs to run a function on the key constant to see what it
>>> corresponds
>>> to)
>>>
>>>  - Not as extendable as getfuncptr (though currently we don't quite know
>>> how
>>> we would extend it, and it's easy to add getfuncptr in the future)
>>>
>>> Notes:
>>>
>>>  - When extended to handle vtable argument types, these still needs to be
>>> some interning or crypto-hashing. But that is likely to come up anyway as
>>> part of a COM-like queryInterface protocol, and at that point we will be
>>> better at making those decisions and design a good interning mechanism.
>>>
>>> Dag
>>>
>>> _______________________________________________
>>> cython-devel mailing list
>>> cython-devel at python.org
>>> http://mail.python.org/mailman/listinfo/cython-devel
>>
>> _______________________________________________
>> cython-devel mailing list
>> cython-devel at python.org
>> http://mail.python.org/mailman/listinfo/cython-devel
>
>
> _______________________________________________
> cython-devel mailing list
> cython-devel at python.org
> http://mail.python.org/mailman/listinfo/cython-devel