Exploiting Dual Core's with Py_NewInterpreter's separated GIL ?

Thu Nov 9 15:37:53 EST 2006

Paul Boddie wrote:
> robert wrote:
>> Shane Hathaway wrote:
>>> of multiple cores.  I think Python only needs a nice way to share a
>>> relatively small set of objects using shared memory.  POSH goes in that
>>> direction, but I don't think it's simple enough yet.
>>>
>>> http://poshmodule.sourceforge.net/
>> interesting, a solution possibly a little faster than pickling - but maybe only in selected
>> situations. Made already experiments with pickling through shared memory.
> 
> What did you discover in your experiments? I'd certainly suspect that
> pickling is going to add an unacceptable degree of overhead in the kind
> of application you're attempting to write (using a shared data
> structure with random access properties), and I'll admit that I haven't
> really had to deal with that kind of situation when using my own
> pickle-based parallelisation solutions (which involve communicating
> processes).
> 
>> With "x = posh.share(x)" an object tree will be (deep-)copied to shared mem ( as far as
>> objects fullfil some conditions http://poshmodule.sourceforge.net/posh/html/node6.html: is
>> this true for numpy arrays?)
> 
> My impression is that POSH isn't maintained any more and that work was
> needed to make it portable, as you have observed. Some discussions did
> occur on one of the Python development mailing lists about the
> possibility of using shared memory together with serialisation
> representations faster than pickles (which also don't need to be
> managed as live objects by Python), and I think that this could be an
> acceptable alternative.
> 
>> Every object to be inserted in the hot tunnel object tree has to be copied that same style.
>> Thus ~pickling, but somewhat easier to use.
> 
> If my understanding of "hot tunnel object tree" is correct, you're
> really wanting fast access involving mutual exclusion to some shared
> data structure. At this point it becomes worth considering some kind of
> distributed object technology (or even a database technology) where you
> have to initialise the data structure and then communicate with an
> object (or database) to perform operations on it, all for reasons I'll
> explain shortly.
> 
> In your ideal situation, you say that you'd have the data structure in
> the same address space as a number of threads, and each thread would be
> able to perform some algorithm on the data structure, but the pattern
> of accessing the structure isn't an insignificant concern even where
> you can assume that the overheads are generally low: reading and
> writing things might be fast in the same address space, but if the
> nature of access involves lots of locking, you'll incur quite a penalty
> anyway. In other words, if each read or write to the structure involves
> acquiring a lock for that operation in isolation, this could
> significantly diminish performance, whereas if you can guarantee that
> the granularity of locking is more coarse than having to occur upon
> each read or write access - that there exist some high-level operations
> that require consistency within the data structure - then reasonable
> performance might be maintained.
> 
> However, if the pattern of concurrent access is restricted to coarse
> operations, where some entity has exclusive access to a potentially
> large dataset, and where the overhead of the communication of inputs
> and outputs to and from that entity is low in comparison to the cost of
> performing such coarse operations, and where such operations are
> themselves performed infrequently, then such a pattern coincides with
> classic database or distributed object scenario. In other words, you
> implement the operations acting on the data structure in a distributed
> object (or as a database query or operation) and then invoke such
> operations from separate processes.
> 
> I hope this makes some sense. Generally, demands for high concurrent
> performance using threads often ignore other important properties such
> as reliability and scalability. Certainly, threads have an important
> place - classically, this involved maintaining responsiveness in
> graphical user interfaces - but even then, the background threads were
> often detached and not competing for the same resources as the
> foreground threads servicing the event loop. The only kinds of
> situation I can think of right now which might benefit from uninhibited
> random access to shared data structures might be things like
> simulations or large-scale multi-user games, where an appropriate data
> architecture cannot be decided in advance, but even then there are
> approaches which attempt to mitigate the seemingly unpredictable nature
> of access to shared data.

Some thoughts may make reason, when one wants to trick the single GIL a little. 
But the length of the text tells us again to possibly get at least refcounting thread safe (within access restrictions) - or get rid of refcount - or do it an go the full path towards a GIL free Python.

Besides some programming costs the speed cost would be that of a few LOCK INC's. Rumors tell not much - and almost nothing when not using non-SMP threading. I'd like to see experimental data regarding the speed. Found nothing on the web so far.

And optimistic all lockfree basic datastructures (obmalloc's freelist, dicts, queues,..) could be built up for futher optimizations:
http://64.233.183.104/search?q=cache:KUXW1Ya2duwJ:softwareforums.intel.com/ids/board/message%3Fboard.id%3D42%26message.id%3D68+SMP+%22LOCK+INC%22+speed&hl=de&gl=de&ct=clnk&cd=2

Little test:

--------------------
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

volatile int x=0;
clock_t c0,c1;
time_t t0,t1;

int main (int argc, char **argv) {
  int sum = 0, i, count = 1000000000;
  t0=time(&t0);
  c0=clock();
  for (i = 0; i < count; ++i) {
      __asm lock inc x;   
      // __asm inc x;   
      sum += x;
  }
  c1=clock();
  t1=time(&t1);
//  printf("%d\n", sum);  
  printf("clocks: %d\n", c1-c0);  
  printf("secs:   %d\n", t1-t0);  
  return sum;
} 

--------------
0040101F   mov         eax,3B9ACA00h
13:     for (i = 0; i < count; ++i) {
14:         __asm lock inc x;
00401024   lock inc    dword ptr [_x (00408a00)]
15:         sum += x;
0040102B   mov         edx,dword ptr [_x (00408a00)]
00401031   add         esi,edx
00401033   dec         eax
00401034   jne         main+24h (00401024)
16:     }
---------------

results on a UP PIII :

INC version:
clocks: 7520
secs:   7

LOCK INC version:
clocks: 36632
secs:   36

Its probably not much...

-robert