[pypy-dev] 1.4.1 threading

Tue Dec 28 02:47:14 CET 2010

Thanks for the RTFP pointer, the Triplett ref, hash tab via rt is
really interesting!

I think I'll have to read it again and again to get all out of it...

This brings me to a new question: how much is pypy python memory model
allowed to diverge from that of cpython?

For example, dict lookups, etc, could be relativistic in the absense
of explicit synchronization, that is, dict updates are not quaranteed
to propagate [fast] until [some] therad acquires a lock [or
something]. Of course the order of dict mutation has to be preserved.
Would this be a worthwhile departure from GIL-based monotonic world?

That is thread-safe code, that uses locks or some other mechanism
[known to pypy] would see operations timely and in order, while code
that doesn't synchronize its operations may see somewhat outdated
shared state, as if threads executed in slightly different global
order.

About profiling, how do I distinguish between attribute vs globals vs
builtins lookups in the profile?

d.

On 27 December 2010 12:50, Paolo Giarrusso <p.giarrusso at gmail.com> wrote:
> On Mon, Dec 27, 2010 at 19:54, Dima Tisnek <dimaqq at gmail.com> wrote:
>> what do you think of liburcu? lttng.org/urcu
>
> I've studied RCU in the Linux kernel (from which URCU derives) and
> thought for a long time of using it for this problem. In short, once
> you have GC, RCU (as in the Linux kernel) becomes (almost) trivial,
> because RCU is almost entirely about how to delay freeing objects.
> The RCU-GC connection is also mentioned by Documentation/RCU/RTFP.txt
> in the Linux sources.
>
> So, you need to just read about persistent data structures, and
> remember that persistence is not the point - data structures which are
> just almost persistent can also be OK, if you get them right. If you
> replace a value in a dictionary, for instance, you don't necessarily
> need to keep the old version around, as long as you the value fits in
> a memory word so that it can be atomically replaced.
> http://hackerboss.com/copy-on-write-101-part-2-putting-it-to-use/
>
> Speaking of libraries, you might want to look at this for the atomic
> primitives you'll need:
> www.hpl.hp.com/research/linux/atomic_ops/
> I'm more used to the primitives of the Linux kernel, I remember I
> didn't fully like that library, but IIRC it was a matter of taste -
> the library is from Hans Boehm, which makes it trustworthy.
>
>> p.s. if/when I get around to profiling pypy, which is something I want
>> to do, I'll be sure to share the results here.
>
> Great, thanks!
>
>> d.
>>
>> On 27 December 2010 10:57, Paolo Giarrusso <p.giarrusso at gmail.com> wrote:
>>> On Mon, Dec 27, 2010 at 07:00, Dima Tisnek <dimaqq at gmail.com> wrote:
>>>> (dan) random binary trees: O(log2(n)) is 7.2 for builtins, though the
>>>> constant factor for locks, etc, might make it worthwhile
>>>
>>> "7.2"? What is that? Do you mean 7.2=log2(no. of builtins)?
>>>
>>>> (paolo) non blocking hash maps: memory barriers can be quite costly too
>>>
>>> Well, you just _cannot_ avoid them, whatever is your fancy data structure.
>>> You can only move most of the cost (or all of it) to write operations
>>> - which is helpful if reads are most common, a case which you discuss
>>> in your mail.
>>>
>>> == Why memory barriers are required (in more detail) ==
>>> If you use persistent data structures, also called copy-on-write or
>>> purely functional (and that's the most obvious way to use a tree in a
>>> parallel setting), the whole synchronization problem is reduced to
>>> exchanging a pointer (to the root of the tree) between a writer and a
>>> reader thread.
>>> Let's consider this in the Java memory model (the state of the art).
>>> In this model the writer thread has to use an expensive StoreLoad
>>> barrier, i.e. mfence on x86 [1], costing more or less as much as a
>>> cache miss, while the reader needs just a cheap Load* barrier (for
>>> free on x86, cheap on other processor).
>>> If you don't use a volatile field, there is very little guarantee
>>> about what your program means, and it is almost always buggy; the
>>> equivalent in the new C++ memory model (which is otherwise very
>>> similar) gives completely undefined semantics (and even crashes are
>>> possible in practice, for instance if the reader thread sees a
>>> not-fully-initialized object because of the race and invokes a virtual
>>> method on it).
>>>
>>> [1] http://g.oswego.edu/dl/jmm/cookbook.html
>>>
>>> == Other stuff ==
>>>> different options will have to be implemented and tested when we get there,
>>>> speaking on which, is there a test dict load?
>>>> does anyone have some profiling data, what portion of runtime is spent
>>>> reading and writing attributes and globals anyways?
>>>
>>> I'd be interested as well in such data.
>>>
>>>> Anyhow, back to parallel interpretation, Is there an issue or page
>>>> that tracks what's needed for parallel interpretation?
>>>> so far  mentioned: gc, dict, c api
>>>
>>>> Btw, I think that jit is more important at the moment, but time comes
>>>> when jit juice has been mostly squeezed out ;-)
>>>
>>> --
>>> Paolo Giarrusso - Ph.D. Student
>>> http://www.informatik.uni-marburg.de/~pgiarrusso/
>>>
>>
>
>
>
> --
> Paolo Giarrusso - Ph.D. Student
> http://www.informatik.uni-marburg.de/~pgiarrusso/
>