[pypy-dev] Would the following shared memory model be possible?

Thu Jul 29 22:53:22 CEST 2010

On Thu, Jul 29, 2010 at 15:15, William Leslie
<william.leslie.ttg at gmail.com> wrote:
> On 29 July 2010 18:55, Maciej Fijalkowski <fijall at gmail.com> wrote:
>> On Thu, Jul 29, 2010 at 10:50 AM, William Leslie
>> <william.leslie.ttg at gmail.com> wrote:
>>> If task X expects that task Y will mutate some object it has, it needs
>>> to go back to the source for every read. This means that if you do use
>>> mutation of some shared object for communication, it needs to be
>>> synchronised before every access. What this means for us is that every
>>> read from a possibly mutable object requires an acquire, and every
>>> write requires a release. It's as if every reference in the program is
>>> implemented with a volatile pointer. Even if the object is never
>>> mutated, there can be a lot of unnecessary bus chatter waiting for
>>> MESI to tell us so.
>>>
>>
>> I do agree there is an overhead. Can you provide some data how much
>> this overhead is? Python is not a very simple language and a lot of
>> things are complex and time consuming, so I wonder how it compares to
>> locking per object.

Below I try to prove that locking is still too expensive, even for an
interpreter.
Also, for many things the clever optimizations you do allow making
those costs small, at least for the average case / fast path. I have
been taught to consider clever optimizations as required. With JIT
compilation, specialization and shadow classes, are method calls much
more expensive than a guard and (if no inlining is done, as might
happen in PyPy in the worst case for big functions) an assembler
'call' opcode, and possibly stack shuffling? How many cycles is that?
How more expensive is that than optimized JavaScript (which is not far
from C, the only difference being the guard)? You can assume the case
of plain calls without keyword arguments and so on (and with inlining,
keyword arguments should pay no runtime cost).

Also, the free threading patches which tried removing the GIL gave an
unacceptable (IIRC 2x) slowdown to CPython in the old days of CPython
1.5. And I don't think they tried to lock every object, just what you
need to lock (which included refcounts).

> It *is* locking per object, but you also spend time looking for the
> data if someone else has invalidated your cache line.

That overhead is already there in locking per object, I think (locking
can be much more expensive than a cache miss, see below).
However, locking per object does not prevent race conditions unless
you make atomic regions as big as actually needed (locking per
statement does not work), it just prevents data races (a conflict
between a write and a memory operation which are not synchronized
between each other). And you can't extend atomic regions indefinitely,
as that implies starvation. Even software transactional memory
requires the programmer to allocate which regions have to be atomic.

Given the additional cost (discussed elsewhere in this mail), and
given that there is not much benefit, I think locking-per-object is
not worth it (but I'd still love to know more about why the effort on
python-safethread was halted).

> Come to think of it, that isn't as bad as it first seemed to me. If
> the sender never mutates the object, it will Just Work on any machine
> with a fairly flat cache architecture.

You first wrote: "The alternative, implicitly writing updates back to
memory as soon as possible and reading them out of memory every time,
can be hundreds or more times slower."
This is not "locking per object", it is just semantically close to it,
and becomes equivalent if only one thread has a reference at any time.

They are very different though performance-wise, and each of them is
better for some usages. In the Linux kernel (which I consider quite
authoritative here, on what you can do in C) both are used for valid
performance reasons, and a JIT compiler could choice between them.
Here, first I describe the two alternatives mentioned. Finally, I go
to the combination for the "unshared case".

- What you first described (memory barriers or uncached R/Ws) can be
faster for small updates, depending on the access pattern. An uncached
memory area does not disturb other memory traffic, unlike memory
barriers which are global, but I don't think an unprivileged process
is allowed to obtain one (by modifying MSRs or PATs, for x86).

Cost: each memory op goes to main memory and is thus as slow as a
cache miss (hundreds of clock cycles). When naively reading a Python
field, many such reads can be possible, but a JIT compiler can bring
it down to the equivalent of a C access with shadow classes and
specialization, and this would pay even more here (V8 does it for
JavaScript and I think PyPy already does most or all of it).

- Locking per object (monitors): slow upfront, but you can do each r/w
out of your cache, so if the object is kept locked for some time, this
is more efficient.
How slow? A system call to perform locking can cost tens of thousands
of cycles. But Java locks, and nowadays even Linux futexes (and
Windows locks), perform everything in userspace in as many cases as
possible (the slowpath is when there is actually contention on the
lock, but it's uncommon with locking-per-object). I won't sum up here
the literature on this.

- Since no contention is expected here, a simple couple of memory
barrier is needed on send/receive (a write barrier for send, a read
one for receive, IIRC). Allowing read-only access to another thread
already brings back to a mixture of the above two solutions. However,
in the 1st solution, using memory barriers, you'd need a write barrier
for every write, but you could save on read barriers.
-- 
Paolo Giarrusso - Ph.D. Student
http://www.informatik.uni-marburg.de/~pgiarrusso/