Parallelization on muli-CPU hardware?

Tue Oct 12 01:14:25 EDT 2004

On Tue, 12 Oct 2004 03:46:15 GMT, Bryan Olson <fakeaddress at nowhere.org> wrote:
>Piet van Oostrum wrote:
>
> >>>>>>Bryan Olson <fakeaddress at nowhere.org> (BO) wrote:
> > BO> I'm not really up-to-date on modern multi-processor support.
> > BO> Back in grad school I read some papers on cache coherence, and I
> > BO> don't know how well the problems have been solved.  The issue
> > BO> was that a single processor can support a one-instruction lock
> > BO> (in the usual no-contention case) simply by supplying an
> > BO> uninterruptable read-and-update instruction, but on a multi-
> > BO> processor, all the processors have respect the lock.
> >
> > For multiprocessor systems, uninterruptible isn't enough (depending 
>on your
> > definition of uninterruptible). Memory operations from other processors
> > also shouldn't interleave with the instruction. Modern processors usually
> > have a 'test and set' or 'compare and swap' instruction for this purpose,
> > which lock memory for the duration of the operation.
>
>I'm with you that far.  The reason I mentioned cache coherence
>is that locking memory isn't enough on MP systems.  Another
>processor may have the value in local cache memory.  For all the
>processors to respect the lock, they have to communicate the
>locking before any processor can update the address.
>
>If every thread treats a certain memory address is used as lock,
>each can access it only by instructions that go to main memory.
>But then locking every object can have a devastating effect on
>speed.
I don't know if smarter instructions are available now, but if a lock
is in a locked state, there is no point in writing locked status value
through to memory, which is what a naive atomic swap would do. The
returned value would just leave you to try again. If n-1 out of n
processors were doing that to a queue lock waiting for something to
be dequeuable, that would be a huge waste of bus bandwidth, maybe
interfering with DMA i/o. I believe the strategy is (or was, if better
instructions are available) to read lock values passively in a short
spin, so that each processor so doing is just looking at its cache value.
Then when the lock is released, that is done by a write through, and
everyone's cache is invalidated and reloaded, and potentially many see the
lock as free. At that point, everyone tries their atomic swap instructions,
writing locked status through to memory, and only one succeeds in getting
the unlocked value back in the atomic swap. As soon as everyone sees locked
again, they go back to passive spinning. The spins don't go on forever, since
multiple threads in one CPU may be waiting for different things. That's a balance
between context switch (within CPU) overhead vs a short spin lock wait. Probably
lock-dependent. Maybe automatically tunable.

The big MP cache hammer comes if multiple CPU's naively dequeue thread work
first-come-first-serve, and they alternate reloading each other's previously
cached thread active memory content. I suspect all of the above is history
long ago by now. It's been a while ;-)

Regards,
Bengt Richter