[Cython] CEP1000: Native dispatch through callables

Sat Apr 14 01:46:39 CEST 2012

On Fri, Apr 13, 2012 at 3:06 PM, Dag Sverre Seljebotn
<d.s.seljebotn at astro.uio.no> wrote:
>
>
> Robert Bradshaw <robertwb at gmail.com> wrote:
>
>>On Fri, Apr 13, 2012 at 1:27 PM, Dag Sverre Seljebotn
>><d.s.seljebotn at astro.uio.no> wrote:
>>> Ah, I didn't think about 6-bit or huffman. Certainly helps.
>>
>>Yeah, we don't want to complicate the ABI too much, but I think
>>something like 8 4-bit common chars and 32 6-bit other chars (or 128
>>8-bit other chars) wouldn't be outrageous. The fact that we only have
>>to encode into a single word makes the algorithm very simple (though
>>the majority of the time we'd spit out pre-encoded literals). We have
>>a version number to play with this as well.
>>
>>> I'm almost +1 on your proposal now, but a couple of more ideas:
>>>
>>> 1) Let the key (the size_t) spill over to the next specialization
>>entry if
>>> it is too large; and prepend that key with a continuation code (two
>>size-ts
>>> could together say "iii)-d\0\0" on 32 bit systems with 8bit encoding,
>>using
>>> - as continuation). The key-based caller will expect a continuation
>>if it
>>> knows about the specialization, and the prepended char will prevent
>>spurios
>>> matches against the overspilled slot.
>>>
>>> We could even use the pointers for part of the continuation...
>>>
>>> 2) Separate the char* format strings from the keys, ie this memory
>>layout:
>>>
>>>
>>Version,nslots,nspecs,funcptr,key,funcptr,key,...,sigcharptr,sigcharptr...
>>>
>>> Where nslots is larger than nspecs if there are continuations.
>>>
>>> OK, this is getting close to my original proposal, but the difference
>>is the
>>> contiunation char, so that if you expect a short signature, you can
>>safely
>>> scan every slot and branching and no null-checking necesarry.
>>
>>I don't think we need nslots (though it might be interesting). My
>>thought is that once you start futzing with variable-length keys, you
>>might as well just compare char*s.
>
> This is where we disagree. If you are the caller you know at compile-time how much you want to match; I think comparing 2 or 3 size-t with no looping is a lot better (a fully-unrolled, 64-bit per instruction strcmp with one of the operands known to the compiler...).

Doesn't the compiler unroll strcmp much like this for a known operand?

>>If one is concerned about memory, one could force the sigcharptr to be
>>aligned, and then the "keys" could be either sigcharptr or key
>>depending on whether the least significant bit was set. One could
>>easily scan for/switch on a key and scanning for a char* would be
>>almost as easy (just don't dereference if the lsb is set).
>>
>>I don't see us being memory constrained, so
>>
>>(version,nspecs,futureuse),(key,sigcharptr,funcptr)*,optionalsigchardata*
>>
>>seems fine to me even if only one of key/sigchrptr is ever used per
>>spec. Null-terminating the specs would work fine as well (one less
>>thing to keep track of during iteration).
>
> Well, can't one always use more L1 cache, or is that not a concern? If you have 5-6 different routines calling each other using this mechanism, each with multiple specializations, those unused slots translate to many cache lines wasted.
>
> I don't think it is that important, I just think that how pretty the C struct declaration ends up looking should not be a concern at all, when the whole point of this is speed anyway. You can always just use a throwaway struct declaration and a cast to get whatever layout you need. If the 'padding' leads to less branching then fine, but I don't see that it helps in any way.

I was more concerned about guaranteeing each char* was aligned.

> To refine my proposal a bit, we have a list of variable size entries,
>
> (keydata, keydata, ..., funcptr)
>
> where each keydata and the ptr is 64 bits on all platforms (see below); each entry must have a total length multiple of 128 bits (so that one can safely scan for a signature in 128 bit increments in the data *without* parsing or branching, you'll never hit a pointer), and each key but the first starts with a 'dash'.

Ah, OK, similar to UTF-8. Yes, I like this idea.

> Signature strings are either kept separate, or even parsed/decoded from the keys. We really only care about speed when you have compiled or JITed code for the case, decoding should be fine otherwise.

True.

> BTW, won't the Cython-generated C code be a horrible mess if we use size-t rather than insist on int64t? (ok, those need some ifdefs for various compilers, but still seem cleaner than operating with 32bit and 64bit keys, and stdint.h is winning ground).

Sure, we could require 64-bit keys (and pointer slots).

On Fri, Apr 13, 2012 at 3:22 PM, Dag Sverre Seljebotn
<d.s.seljebotn at astro.uio.no> wrote:
>>> I am really lost here. Why is any of this complicated encoding stuff
>>> better than interning? Interning takes one line of code, is
>>incredibly
>>> cheap (one dict lookup per call site and function definition), and it
>>> lets you check any possible signature (even complicated ones
>>involving
>>> memoryviews) by doing a single-word comparison. And best of all, you
>>> don't have to think hard to make sure you got the encoding right. ;-)
>>>
>>> On a 32-bit system, pointers are smaller than a size_t, but more
>>> expressive! You can still do binary search if you want, etc. Is the
>>> problem just that interning requires a runtime calculation? Because I
>>> feel like C users (like numpy) will want to compute these compressed
>>> codes at module-init anyway, and those of us with a fancy compiler
>>> capable of computing them ahead of time (like Cython) can instruct
>>> that fancy compiler to compute them at module-init time just as
>>> easily?
>>
>>Good question.
>>
>>The primary disadvantage of interning that I see is memory locality. I
>>suppose if all the C-level caches of interned values were co-located,
>>this may not be as big of an issue. Not being able to compare against
>>compile-time constants may thwart some optimization opportunities, but
>>that's less clear.
>>
>>It also requires coordination common repository, but I suppose one
>>would just stick a set in some standard module (or leverage Python's
>>interning).
>
> More problems:
>
> 1) It doesn't work well with multiple interpreter states. Ok, nothing works with that at the moment, but it is on the roadmap for Python and we should not make it worse.
>
> You basically *need* a thread safe store separate from any python interpreter; though pythread.h does not rely on the interpreter state; which helps.

I didn't know about the push for multiple interpreter states, but
yeah, that makes things much more painful.

> 2) you end up with the known comparison values in read-write memory segments rather than readonly segments, which is probably worse on multicore systems?

Yeah, this is the kind of stuff I was vaguely worried about when I
wrote "Not being able to compare against compile-time constants may
thwart some optimization opportunities." I don't know what the impact
is, but it's worth trying to measure and take into account.

> I really think that anything that we can do to make this near-c-speed should be done; none of the proposals are *that* complicated.
>
> Using keys, NumPy can in the C code choose to be slower but more readable; but using interned string forces cython to be slower, cython gets no way of choosing to go faster. (to the degree that it has an effect; none of these claims were checked)

Yep, agreed.

- Robert