[Numpy-discussion] numpy.random and multiprocessing

Thu Dec 11 11:36:47 EST 2008

On Fri, Dec 12, 2008 at 12:57:26AM +0900, David Cournapeau wrote:
> > [array([ 0.35773964,  0.63945684,  0.50855196,  0.08631373]), array([
> > 0.35773964,  0.63945684,  0.50855196,  0.08631373]), array([ 0.35773964,
> > 0.63945684,  0.50855196,  0.08631373]), array([ 0.65357725,  0.35649382,
> > 0.02203999,  0.7591353 ])]

> > In other words, the 4 processes give me the same exact results.

> Why do you say the results are the same ? They don't look the same to
> me - only the first three are the same.

Correct. I wonder why. When I try on my box currently I almost always get
the same four. But not all the time. More on that below.

> > Now I understand why this is the case: the different instances of the
> > random number generator where created by forking from the same process,
> > so they are exactly the very same object. This is howver a fairly bad
> > trap. I guess other people will fall into it.

> I am not sure I am following: the objects in python are not the same
> if you fork a process, or I don't understand what you mean by same.
> They may be initialized the same way, though.

Yes, they are initiate with the same seed value. I call them the same
because right after the fork they are. The can evolve separately, though.
However our PRNG is completely defined by its seed, AFAIK.

> Isn't the problem simply due to seeding from the same value ? For such
> a tiny problem (4 tasks whose processing time is negligeable), the
> seed will be the same since the intervals between the sampling will be
> small.

Right, but I found the problem in real code, that was not tiny at all.

> Taking a look at the mtrand code in numpy, if the seed is not given,
> it is taken from /dev/random if available, or the time clock if not; I
> don't know what the semantics are for concurrent access to /dev/random
> (is it gauranteed that two process will get different values from it
> ?).

> To confirm this, you could try to use your toy example with 500 jobs
> instead of 4: in that case, it is unlikely they use the same
> underlying value as a starting point, even if there is no gurantee on
> concurrent access of /dev/random.

I found the problem on way bigger code. I have only 8 cpus, so I run
8 jobs, and each job loops on the tasks. I noticed that the variance was
much smaller than expected. The jobs take 10 minutes, so
you can't call them tiny or fast. The problem indeed appears in
production code.

The way I interpret this is that the seed is created only at
module-import time (this is how I read the code in mtrand.pyx). For all
my processes, the seed was created when numpy was imported in the mother
process. After the fork, the seed is the same in each process. As a
result the entropy of the whole system is clearly not the entropy of 4
independant systems. As you point out the fourth value in my toy example
differs from the others, so somehow my picture is not exact. But it
remains that the entropy is way too low in my production code.

I don't understand why, once in a while, there is a value that is
different. That could be because numpy is reimported in the child
processes. If I insert a 'time.sleep' in my for loop that
spawns the processes, I get significantly higher entropy only if the
sleep is around 1 second. Looking at the seed code (rk_randomseed in
randomkit.c), it seems that /dev/urandom is not used, contrary to what
the random.seed docstring pretends, and what is really used is
gettimeofday under windows, and _ftime under Unix. It does seem, though
that the milliseconds are used. 

I must admit I don't fully understand why this happens. I thought that:

    a) Modules where not reimported with multiprocess, thanks to the
       fork. If this where true, reading mtrand.pyx, all subprocesses
       should have the same seed.

    b) /dev/urandom was used to seed. This seems wrong. Reading the code
       shows no dev/urandom in the seeding parts.

    c) milliseconds where used, so we should be rather safe from these
       race-condition. The code does seem to hint toward that, but if I
       add a sleep(0.01) to my loop, I don't get enough entropy. I did
       check that sleep(0.01) was sleeping at least 0.01 seconds.

> > I wonder if we can find a way to make this more user friendly? Would be
> > easy, in the C code, to check if the PID has changed, and if so reseed
> > the random number generator? I can open up a ticket for this if people
> > think this is desirable (I think so).

> This sounds like too much magic for a very particular use: there may
> be cases where you want the same seed in multiple processes (what if
> you processes are not created from multiprocess, and you want to make
> sure you have the same seed ?).

Well, yes, for code that wants to explicitely control the seed, ressed
automaticaly would be a problem, and we need to figure out a way to make
this deterministic (eg for testing purposes). However, this is a small
usecase, and when testing people need to be aware of seeding problems
(although they might not understand fork semantics). More and more people
are going to be using multiprocessing: it comes with the standard
library, and standard boxes nowadays have many cores, and will soon have
much more. Resampling and brute-force Monte Carlo techniques are
embarrassingly parallel, so people will want to use parallel computing on
them. I fear many others are going to fall in this trap.

Gaël