2.6, 3.0, and truly independent intepreters

Fri Oct 24 17:16:50 EDT 2008

On Oct 24, 3:02 pm, Glenn Linderman <v+pyt... at g.nevcal.com> wrote:
> On approximately 10/23/2008 2:24 PM, came the following characters from the
> keyboard of Rhamphoryncus:
>>
>> On Oct 23, 11:30 am, Glenn Linderman <v+pyt... at g.nevcal.com> wrote:
>>
>>>
>>> On approximately 10/23/2008 12:24 AM, came the following characters from
>>> the keyboard of Christian Heimes
>>>>
>>>> Andy wrote:
>>>> I'm very - not absolute, but very - sure that Guido and the initial
>>>> designers of Python would have added the GIL anyway. The GIL makes
>>>> Python faster on single core machines and more stable on multi core
>>>> machines.
>
> Actually, the GIL doesn't make Python faster; it is a design decision that
> reduces the overhead of lock acquisition, while still allowing use of global
> variables.
>
> Using finer-grained locks has higher run-time cost; eliminating the use of
> global variables has a higher programmer-time cost, but would actually run
> faster and more concurrently than using a GIL. Especially on a
> multi-core/multi-CPU machine.

Those "globals" include classes, modules, and functions.  You can't
have *any* objects shared.  Your interpreters are entirely isolated,
much like processes (and we all start wondering why you don't use
processes in the first place.)

Or use safethread.  It imposes safe semantics on shared objects, so
you can keep your global classes, modules, and functions.  Still need
garbage collection though, and on CPython that means refcounting and
the GIL.

>> Another peeve I have is his characterization of the observer pattern.
>> The generalized form of the problem exists in both single-threaded
>> sequential programs, in the form of unexpected reentrancy, and message
>> passing, with infinite CPU usage or infinite number of pending
>> messages.
>>
>
> So how do you get reentrancy is a single-threaded sequential program? I
> think only via recursion? Which isn't a serious issue for the observer
> pattern. If you add interrupts, then your program is no longer sequential.

Sorry, I meant recursion.  Why isn't it a serious issue for
single-threaded programs?  Just the fact that it's much easier to
handle when it does happen?

>> Try looking at it on another level: when your CPU wants to read from a
>> bit of memory controlled by another CPU it sends them a message
>> requesting they get it for us.  They send back a message containing
>> that memory.  They also note we have it, in case they want to modify
>> it later.  We also note where we got it, in case we want to modify it
>> (and not wait for them to do modifications for us).
>>
>
> I understand that level... one of my degrees is in EE, and I started college
> wanting to design computers (at about the time the first microprocessor chip
> came along, and they, of course, have now taken over). But I was side-lined
> by the malleability of software, and have mostly practiced software during
> my career.
>
> Anyway, that is the level that Herb Sutter was describing in the Dr Dobbs
> articles I mentioned. And the overhead of doing that at the level of a cache
> line is high, if there is lots of contention for particular memory locations
> between threads running on different cores/CPUs. So to achieve concurrency,
> you must not only limit explicit software locks, but must also avoid memory
> layouts where data needed by different cores/CPUs are in the same cache
> line.

I suspect they'll end up redesigning the caching to use a size and
alignment of 64 bits (or smaller).  Same cache line size, but with
masking.

You still need to minimize contention of course, but that should at
least be more predictable.  Having two unrelated mallocs contend could
suck.

>> Message passing vs shared memory isn't really a yes/no question.  It's
>> about ratios, usage patterns, and tradeoffs.  *All* programs will
>> share data, but in what way?  If it's just the code itself you can
>> move the cache validation into software and simplify the CPU, making
>> it faster.  If the shared data is a lot more than that, and you use it
>> to coordinate accesses, then it'll be faster to have it in hardware.
>>
>
> I agree there are tradeoffs... unfortunately, the hardware architectures
> vary, and the languages don't generally understand the hardware. So then it
> becomes an OS API, which adds the overhead of an OS API call to the cost of
> the synchronization... It could instead be (and in clever applications is) a
> non-portable assembly level function that wraps on OS locking or waiting
> API.

In practice I highly doubt we'll see anything that doesn't extend
traditional threading (posix threads, whatever MS has, etc).

> Nonetheless, while putting the shared data accesses in hardware might be
> more efficient per unit operation, there are still tradeoffs: A software
> solution can group multiple accesses under a single lock acquisition; the
> hardware probably doesn't have enough smarts to do that. So it may well
> require many more hardware unit operations for the same overall concurrently
> executed function, and the resulting performance may not be any better.

Speculative ll/sc? ;)

> Sidestepping the whole issue, by minimizing shared data in the application
> design, avoiding not only software lock calls, and hardware cache
> contention, is going to provide the best performance... it isn't the things
> you do efficiently that make software fast — it is the things you don't do
> at all.

Minimizing contention, certainly.  Minimizing the shared data itself
is iffier though.