[Python-ideas] easy thread-safety [was: fork]

Wed Aug 19 23:10:58 CEST 2015

On 19.08.2015 04:09, Andrew Barnert wrote:
> On Aug 18, 2015, at 13:32, Sven R. Kunze <srkunze at mail.de> wrote:
>> Indeed. I think that is sensible approach here. Speaking of an implementation though, I don't know where I would start when looking at CPython.
>>
>> Thinking more about id(). Consider a complex object like an instance of a class. Is it really necessary to deep copy it? It seems to me that we actually just need to hide the atomic/immutable values (e.g. strings, integers etc.) of that object.
> Why wouldn't hiding the mutable members be just as necessary? In your example, if I can replace Y.x, isn't that even worse than replacing Y.x.a?
That was a question to the experts. I don't know.

Not sure what you mean about 'worse'. Replacing Y.x is just pointer to 
some value. So, if I replace it with something else, it's not 
different/worse/better than replacing Y.x.a, right?
>
>> The object itself can remain the same.
> What does it mean for an object to be "the same" if it potentially holds different values in different threads.
I was talking about the id(...) and deep copying.
>> # first thread
>> class X:
>>     a = 0
>> class Y:
>>     x = X
>>
>> #thread spawned by first thread
>> Y.x.a = 3  # should leave id(X) and id(Y) alone  (*)
> OK, but does the second thread see 0 or 3? If the former, then these aren't shared objects at all. If the latter, then that's how things already work.

before (*)
     first thread sees: Y.x.a == 0
     thread spawned by first thread sees: Y.x.a = 0

after (*)
     first thread sees: Y.x.a == 0
     thread spawned by first thread sees: Y.x.a = 3

If that wasn't clear, we were talking about the preferred 'process-like' 
semantics.

Just assume for a moment, that, by default, Python would wrap up 
variables (as soon as they are shared across 2 or more threads) like 
this (self is the variable):

class ProxyObject:

     def __init__(self, variable)
         self.__original__ = variable
         self.__threaded__ = threading.local()

def __proxy_get__(self):
return getattr(self.__threaded__, 'value', self.__original)

def __proxy_set__(self, value):
self.__threaded__.value = value

I think you get the idea; it should work like descriptors. Basically, 
descriptors for general access on a variable and not for classes only => 
proxy objects. Is there something like that Python? That would vastly 
simplify the implementation of xfork, btw.

So, to get you some example (still assuming the behavior described 
above), I abuse our venerable thumbnails. Let's calculate the total sum 
of the thumbnail bytes created:

  1: images = ['0.jpg', '1.jpg', '2.jpg', '3.jpg', '4.jpg']
  2: sizes = []
  3: for image in images:
  4:     fork create_thumbnails(image, sizes)
  5: wait # for all forks to come back
  6: shared sizes
  7: print('sum:', sum(sizes))
  8:
  9: @io_bound
10: def create_thumbnails(image, sizes):
11:     with open(image) as image_file:
12:         # and so forth
13:     shared sizes
14:     sizes.append(100)

Here, you can see what I meant by explicitly stating that we enter 
dangerous space: the keyword "shared" in lines 5 and 13.

It basically removes the wrapper described above and reveals the 
dangerous/shared stated of the object (like 'global'). So, both 
functions needs to agree to remove the veil and thus to be able to 
read/modify the shared state.

shared x

translates to:

x = x.__original__

> Have you looked into the subinterpreters project, the PyParallel project, or the PyPy-STM project, all of which, as I mentioned earlier, are possible ways of getting some of the advantages of process semantics without all of the performance costs? (Although none of them are exactly that, of course.)

Yes, I did. STM is nice as a proof of concept, waiting for HTM. However, 
as I mentioned it earlier I am not sure whether I would really want that 
within the semantics of multiple threads.

Trent Nelson (PyParallel) seems to agree on this. It's kind of weird and 
would be followed by all sorts of workarounds in case of a transaction 
failure.

The general intention of PyParallel seems to be interesting. It also is 
all in about "built-in thread-safety", which is very nice. Trent also 
agrees on 'never share state'.
>
>> To me, a process/thread or any other concurrency solution, is basically a function that I can call but runs in the background. Later, when I am ready, I can collect its result. In the meantime, the main thread continues. (Again) to me, that is the only sensible way to approach concurrency. When I recall the details of locks, semaphores etc. and compare it to what real-world applications really need... You can create huge tables of all the possible cases that might happen just in order to find out that you missed an important one.
> Yes, that is the problem that makes multithreading hard in the first place (except in pure functional languages). If the same value is visible in two threads, and can be changed by either of those threads, you have to start thinking either about lock discipline, or about ordering atomic operations; either way, things get very complicated very fast.
>
> A compromise solution is to allow local mutable objects, but not allow them to be shared between threads; instead, you provide a way to (deep-)copy them between threads, and/or to (destructively) move them between threads. You can do that syntactically, as with the channel operators used by Erlang and the languages it's inspired, or you can do it purely at a semantic level, as with Python's multiprocessing library; the effect is the same: process semantics, or message-passing semantics, or whatever you want to call it gives you the advantages of immutable threading in a language with mutability.
>
>> Even worse, as soon as you change something about your program, you are doomed to redo the complete case analysis, find a dead/live-lock-free solution and so forth. It's a time sink; costly and dangerous from a company's point of view.
> This is an argument for companies to share as little mutable state as possible across threads. If you don't have any shared state at all, you don't need locks or other synchronization mechanisms at all. If you only have very limited and specific shared state, you have very limited and hopefully simple locking, which is a lot easier to keep track of.

I am glad, we agree on this. However, just saying it's hard and keeping 
status quo does not help, I suppose.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20150819/1ed87939/attachment.html>