[Cython] CEP1000: Native dispatch through callables

Sat Apr 14 10:36:55 CEST 2012

Nathaniel Smith <njs at pobox.com> wrote:

>On Fri, Apr 13, 2012 at 11:22 PM, Dag Sverre Seljebotn
><d.s.seljebotn at astro.uio.no> wrote:
>>
>>
>> Robert Bradshaw <robertwb at gmail.com> wrote:
>>
>>>On Fri, Apr 13, 2012 at 2:24 PM, Nathaniel Smith <njs at pobox.com>
>wrote:
>>>> On Fri, Apr 13, 2012 at 9:27 PM, Dag Sverre Seljebotn
>>>> <d.s.seljebotn at astro.uio.no> wrote:
>>>>> Ah, I didn't think about 6-bit or huffman. Certainly helps.
>>>>>
>>>>> I'm almost +1 on your proposal now, but a couple of more ideas:
>>>>>
>>>>> 1) Let the key (the size_t) spill over to the next specialization
>>>entry if
>>>>> it is too large; and prepend that key with a continuation code
>(two
>>>size-ts
>>>>> could together say "iii)-d\0\0" on 32 bit systems with 8bit
>>>encoding, using
>>>>> - as continuation). The key-based caller will expect a
>continuation
>>>if it
>>>>> knows about the specialization, and the prepended char will
>prevent
>>>spurios
>>>>> matches against the overspilled slot.
>>>>>
>>>>> We could even use the pointers for part of the continuation...
>>>>
>>>> I am really lost here. Why is any of this complicated encoding
>stuff
>>>> better than interning? Interning takes one line of code, is
>>>incredibly
>>>> cheap (one dict lookup per call site and function definition), and
>it
>>>> lets you check any possible signature (even complicated ones
>>>involving
>>>> memoryviews) by doing a single-word comparison. And best of all,
>you
>>>> don't have to think hard to make sure you got the encoding right.
>;-)
>>>>
>>>> On a 32-bit system, pointers are smaller than a size_t, but more
>>>> expressive! You can still do binary search if you want, etc. Is the
>>>> problem just that interning requires a runtime calculation? Because
>I
>>>> feel like C users (like numpy) will want to compute these
>compressed
>>>> codes at module-init anyway, and those of us with a fancy compiler
>>>> capable of computing them ahead of time (like Cython) can instruct
>>>> that fancy compiler to compute them at module-init time just as
>>>> easily?
>>>
>>>Good question.
>>>
>>>The primary disadvantage of interning that I see is memory locality.
>I
>>>suppose if all the C-level caches of interned values were co-located,
>>>this may not be as big of an issue. Not being able to compare against
>>>compile-time constants may thwart some optimization opportunities,
>but
>>>that's less clear.
>
>I would like to see some demonstration of this. E.g., you can run this:
>
>echo -e '#include <string.h>\nint main(int argc, char ** argv) {
>return strcmp(argv[0], "a"); }' | gcc -S -x c - -o - -O2 | less
>
>Looks to me like for a short, known-at-compile-time string, with
>optimization on, gcc implements it by basically sticking the string in
>a global variable and then using a pointer... (If I do argv[0] ==
>(char *)0x1234, then it places the constant value directly into the
>instruction stream. Strangely enough, it does *not* inline the
>constant value even if I do memcmp(&argv[0], "\1\2\3\4", 4), which
>should be exactly equivalent...!)

Right. So:

 - With keys you have the *option* of hardcoding them, and then they will be in the instruction stream (rather than the instruction stream containing, essentially, a pointer to the key).

 - With interned, you always have a pointer you must dereference in the instruction stream.

>
>I think gcc is just as likely to stick a bunch of
>  static void * interned_dd_to_d;
>  static void * interned_ll_to_l;
>next to each other in the memory image as it is to stick a bunch of
>equivalent manifest constants. If you're worried, make it static void
>* interned_signatures[NUM_SIGNATURES] -- then they'll definitely be
>next to each other.
>
>>>It also requires coordination common repository, but I suppose one
>>>would just stick a set in some standard module (or leverage Python's
>>>interning).
>>
>> More problems:
>>
>> 1) It doesn't work well with multiple interpreter states. Ok, nothing
>works with that at the moment, but it is on the roadmap for Python and
>we should not make it worse.
>
>This isn't a criticism, but I'd like to see a reference to the work in
>this direction! My impression was that it's been on the roadmap for
>maybe a decade, in a really desultory fashion:
>http://docs.python.org/faq/library.html#can-t-we-get-rid-of-the-global-interpreter-lock
>So if it's actually happening that's quite interesting.

I wasn't referring to the GIL, but multiple interpreters (where objects from one cannot be used in another). PEP3121 mentions it as one of the things it prepares for. Perhaps that didn't go anywhere, I don't really know.

>
>> You basically *need* a thread safe store separate from any python
>interpreter; though pythread.h does not rely on the interpreter state;
>which helps.
>
>Anyway, yes, if you can't rely on the interpreter than you'd need some
>place to store the intern table, but I'm not sure why this would be a
>problem (in Python 3.6 or whenever it becomes relevant).
>
>> 2) you end up with the known comparison values in read-write memory
>segments rather than readonly segments, which is probably worse on
>multicore systems?
>
>Is it? Can you elaborate? Cache ping-ponging is certainly bad, but
>that's when multiple cores are writing to the same cache line, I can't
>see how the TLB flags would matter.
>
>I guess the problem would be if you also have some other data in the
>global variable space that you write to constantly, and then it turned
>out they were placed next to these read-only comparison values in the
>same cache line?

You may be right, my understanding of this is actually too vague. Anyway, if the constant ends up in the instruction stream it is at least one less register load from data cache with the key approach?

Dag

>
>> I really think that anything that we can do to make this near-c-speed
>should be done; none of the proposals are *that* complicated.
>
>I agree, but I object to codifying the waving on dead chickens. :-)
>
>> Using keys, NumPy can in the C code choose to be slower but more
>readable; but using interned string forces cython to be slower, cython
>gets no way of choosing to go faster. (to the degree that it has an
>effect; none of these claims were checked)
>
>I think the only slowdown we know of is a few dict lookups at module
>load time.
>
>- N
>_______________________________________________
>cython-devel mailing list
>cython-devel at python.org
>http://mail.python.org/mailman/listinfo/cython-devel

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.