[Cython] CEP1000: Native dispatch through callables

Sat Apr 14 02:19:41 CEST 2012

On Fri, Apr 13, 2012 at 11:22 PM, Dag Sverre Seljebotn
<d.s.seljebotn at astro.uio.no> wrote:
>
>
> Robert Bradshaw <robertwb at gmail.com> wrote:
>
>>On Fri, Apr 13, 2012 at 2:24 PM, Nathaniel Smith <njs at pobox.com> wrote:
>>> On Fri, Apr 13, 2012 at 9:27 PM, Dag Sverre Seljebotn
>>> <d.s.seljebotn at astro.uio.no> wrote:
>>>> Ah, I didn't think about 6-bit or huffman. Certainly helps.
>>>>
>>>> I'm almost +1 on your proposal now, but a couple of more ideas:
>>>>
>>>> 1) Let the key (the size_t) spill over to the next specialization
>>entry if
>>>> it is too large; and prepend that key with a continuation code (two
>>size-ts
>>>> could together say "iii)-d\0\0" on 32 bit systems with 8bit
>>encoding, using
>>>> - as continuation). The key-based caller will expect a continuation
>>if it
>>>> knows about the specialization, and the prepended char will prevent
>>spurios
>>>> matches against the overspilled slot.
>>>>
>>>> We could even use the pointers for part of the continuation...
>>>
>>> I am really lost here. Why is any of this complicated encoding stuff
>>> better than interning? Interning takes one line of code, is
>>incredibly
>>> cheap (one dict lookup per call site and function definition), and it
>>> lets you check any possible signature (even complicated ones
>>involving
>>> memoryviews) by doing a single-word comparison. And best of all, you
>>> don't have to think hard to make sure you got the encoding right. ;-)
>>>
>>> On a 32-bit system, pointers are smaller than a size_t, but more
>>> expressive! You can still do binary search if you want, etc. Is the
>>> problem just that interning requires a runtime calculation? Because I
>>> feel like C users (like numpy) will want to compute these compressed
>>> codes at module-init anyway, and those of us with a fancy compiler
>>> capable of computing them ahead of time (like Cython) can instruct
>>> that fancy compiler to compute them at module-init time just as
>>> easily?
>>
>>Good question.
>>
>>The primary disadvantage of interning that I see is memory locality. I
>>suppose if all the C-level caches of interned values were co-located,
>>this may not be as big of an issue. Not being able to compare against
>>compile-time constants may thwart some optimization opportunities, but
>>that's less clear.

I would like to see some demonstration of this. E.g., you can run this:

echo -e '#include <string.h>\nint main(int argc, char ** argv) {
return strcmp(argv[0], "a"); }' | gcc -S -x c - -o - -O2 | less

Looks to me like for a short, known-at-compile-time string, with
optimization on, gcc implements it by basically sticking the string in
a global variable and then using a pointer... (If I do argv[0] ==
(char *)0x1234, then it places the constant value directly into the
instruction stream. Strangely enough, it does *not* inline the
constant value even if I do memcmp(&argv[0], "\1\2\3\4", 4), which
should be exactly equivalent...!)

I think gcc is just as likely to stick a bunch of
  static void * interned_dd_to_d;
  static void * interned_ll_to_l;
next to each other in the memory image as it is to stick a bunch of
equivalent manifest constants. If you're worried, make it static void
* interned_signatures[NUM_SIGNATURES] -- then they'll definitely be
next to each other.

>>It also requires coordination common repository, but I suppose one
>>would just stick a set in some standard module (or leverage Python's
>>interning).
>
> More problems:
>
> 1) It doesn't work well with multiple interpreter states. Ok, nothing works with that at the moment, but it is on the roadmap for Python and we should not make it worse.

This isn't a criticism, but I'd like to see a reference to the work in
this direction! My impression was that it's been on the roadmap for
maybe a decade, in a really desultory fashion:
  http://docs.python.org/faq/library.html#can-t-we-get-rid-of-the-global-interpreter-lock
So if it's actually happening that's quite interesting.

> You basically *need* a thread safe store separate from any python interpreter; though pythread.h does not rely on the interpreter state; which helps.

Anyway, yes, if you can't rely on the interpreter than you'd need some
place to store the intern table, but I'm not sure why this would be a
problem (in Python 3.6 or whenever it becomes relevant).

> 2) you end up with the known comparison values in read-write memory segments rather than readonly segments, which is probably worse on multicore systems?

Is it? Can you elaborate? Cache ping-ponging is certainly bad, but
that's when multiple cores are writing to the same cache line, I can't
see how the TLB flags would matter.

I guess the problem would be if you also have some other data in the
global variable space that you write to constantly, and then it turned
out they were placed next to these read-only comparison values in the
same cache line?

> I really think that anything that we can do to make this near-c-speed should be done; none of the proposals are *that* complicated.

I agree, but I object to codifying the waving on dead chickens. :-)

> Using keys, NumPy can in the C code choose to be slower but more readable; but using interned string forces cython to be slower, cython gets no way of choosing to go faster. (to the degree that it has an effect; none of these claims were checked)

I think the only slowdown we know of is a few dict lookups at module load time.

- N