[Python-ideas] solving multi-core Python

Thu Jun 25 00:10:48 CEST 2015

I'm going to break mail client threading and also answer some of your
other emails here.

On Tue, Jun 23, 2015 at 10:26 PM, Eric Snow <ericsnowcurrently at gmail.com> wrote:
> It sounded like you were suggesting that we factor out a common code
> base that could be used by multiprocessing and the other machinery and
> that only multiprocessing would keep the pickle-related code.

Yes, I like that idea a lot.

>> Compare
>> with forking, where the initialization is all done and then you fork,
>> and you are immediately ready to serve, using the data structures
>> shared with all the other workers, which is only copied when it is
>> written to. So forking starts up faster and uses less memory (due to
>> shared memory.)
>
> But we are aiming for a share-nothing model with an efficient
> object-passing mechanism.  Furthermore, subinterpreters do not have to
> be single-use.  My proposal includes running tasks in an existing
> subinterpreter (e.g. executor pool), so that start-up cost is
> mitigated in cases where it matters.
>
> Note that ultimately my goal is to make it obvious and undeniable that
> Python (3.6+) has a good multi-core story.  In my proposal,
> subinterpreters are a means to an end.  If there's a better solution
> then great!  As long as the real goal is met I'll be satisfied. :)
> For now I'm still confident that the subinterpreter approach is the
> best option for meeting the goal.

Ahead of time: the following is my opinion. My opinions are my own,
and bizarre, unlike the opinions of my employer and coworkers. (Who
are also reading this maybe.)

So there's two reasons I can think of to use threads for CPU parallelism:

- My thing does a lot of parallel work, and so I want to save on
memory by sharing an address space

This only becomes an especially pressing concern if you start running
tens of thousands or more of workers. Fork also allows this.

- My thing does a lot of communication, and so I want fast
communication through a shared address space

This can become a pressing concern immediately, and so is a more
visible issue. However, it's also a non-problem for many kinds of
tasks which just take requests in and put output back out, without
talking with other members of the pool (e.g. writing an RPC server or
HTTP server.)

I would also speculate that once you're on many machines, unless
you're very specific with your design, RPC costs dominate IPC costs to
the point where optimizing IPC doesn't do a lot for you.

On Unix, IPC can be free or cheap due to shared memory.

Threads really aren't all that important, and if we need them, we have
them. When people tell me in #python that multicore in Python is bad
because of the GIL, I point them at fork and at C extensions, but also
at PyPy-STM and Jython. Everything has problems, but then so does this
proposal, right?

> And this is faster than passing objects around within the same
> process?  Does it play well with Python's memory model?

As far as whether it plays with the memory model,
multiprocessing.Value() just works, today. To make it even lower
overhead (not construct an int PyObject* on the fly), you need to
change things, e.g. the way refcounts work. I think it's possibly
feasible. If not, at least the overhead would be negligible.

Same applies to strings and other non-compound datatypes. Compound
datatypes are hard even for the subinterpreter case, just because the
objects you're referring to are not likely to exist on the other end,
so you need a real copy. I'm sure you've thought about this.
multiprocessing.Array has a solution for this, which is to unbox the
contained values. It won't work with tuples.

> I'd be interested in more info on both the refcount freezing and the
> sepatate refcounts pages.

 I can describe the patches:

- separate refcounts replaces refcount with a pointer to refcount, and
changes incref/decref.
- refcount freezing lets you walk all objects and set the reference
count to a magic value. incref/decref check if the refcount is frozen
before working.

With freezing, unlike this approach to separate refcounts, anyone that
touches the refcount manually will just dirty the page and unfreeze
the refcount, rather than crashing the process.

Both of them will decrease performance for non-forking python code,
but for forking code it can be made up for e.g. by increased worker
lifetime and decreased rate of page copying, plus the whole CPU vs
memory tradeoff.

I legitimately don't remember the difference in performance, which is
good because I'm probably not allowed to say what it was, as it was
tested on our actual app and not microbenchmarks. ;)

>> And remember that we *do* have many examples of people using
>> parallelized Python code in production. Are you sure you're satisfying
>> their concerns, or whose concerns are you trying to satisfy?
>
> Another good point.  What would you suggest is the best way to find out?

I don't necessarily mean that. I mean that this thread feels like you
posed an answer and I'm not sure what the question is. Is it about
solving a real technical problem? What is that, and who does it
affect? A new question I didn't ask before: is the problem with Python
as a whole, or just CPython?

-- Devin