Exploiting Dual Core's with Py_NewInterpreter's separated GIL ?

Tue Nov 7 03:19:49 EST 2006

"Martin v. Löwis" <martin at v.loewis.de> writes:
> Ah, but in the case where the lock# signal is used, it's known that
> the data is not in the cache of the CPU performing the lock operation;
> I believe it is also known that the data is not in the cache of any
> other CPU. So the CPU performing the LOCK INC sequence just has
> to perform two memory cycles. No cache coherency protocol runs
> in that case.

How can any CPU know in advance that the data is not in the cache of
some other CPU?

> But even when caching is involved, I don't see why there should be
> more than three memory cycles. The MESI "protocol" really consists
> only of two pins: HIT# and HITM#; the actual implementation is just
> in keeping the state for each cache line, and in snooping. There
> CPU's don't really "communicate". Instead, if one processor tries
> to fill a cache line, the others snoop the read, and assert either
> HIT# (when they have not modified their cache lines) or HITM#
> (when they do have modified their cache lines). Assertions of
> these signals is also immediate.

OK, this is logical, but it already implies a cache miss, which costs
many dozen (100?) cycles.  But this case may be uncommon, since one
hops that cache misses are relatively rare.  

> The worst case would be that one processor performs a LOCK INC,
> and another processor has the modified value in its cache line.
> So it needs to first flush the cache line, before the other
> processor can modify the memory. If the memory is not cached
> yet in another processor, the MESI protocol does not involve
> additional penalties.

I think for Python refcounts this case must occur quite frequently
since there are many Python objects (small integers, None, etc.)
whose refcounts get modified extremely often.  

IIRC, the SPJ paper that I linked claims that lock-free protocols
outperform traditional lock-based ones even with just two processors.
But maybe things are better with a dual core processor (shared cache)
than with two separate packages.