[Numpy-discussion] numpy.random and multiprocessing

Thu Dec 11 11:20:48 EST 2008

Gael Varoquaux wrote:
> Hi there,
>
> I have been using the multiprocessing module a lot to do statistical tests
> such as Monte Carlo or resampling, and I have just discovered something
> that makes me wonder if I haven't been accumulating false results. Given
> two files:
>
> === test.py ===
> from test_helper import task
> from multiprocessing import Pool
>
> p = Pool(4)
>
> jobs = list()
> for i in range(4):
>     jobs.append(p.apply_async(task, (4, )))
>
> print [j.get() for j in jobs]
>
> p.close()
> p.join()
>
> === test_helper.py ===
> import numpy as np
>
> def task(x):
>     return np.random.random(x)
>
> =======
>
> If I run test.py, I get:
>
> [array([ 0.35773964,  0.63945684,  0.50855196,  0.08631373]), array([
> 0.35773964,  0.63945684,  0.50855196,  0.08631373]), array([ 0.35773964,
> 0.63945684,  0.50855196,  0.08631373]), array([ 0.65357725,  0.35649382,
> 0.02203999,  0.7591353 ])]
>
> In other words, the 4 processes give me the same exact results.
>
> Now I understand why this is the case: the different instances of the
> random number generator where created by forking from the same process,
> so they are exactly the very same object. This is howver a fairly bad
> trap. I guess other people will fall into it.
>
> The take home message is: 
> **call 'numpy.random.seed()' when you are using multiprocessing**
>
> I wonder if we can find a way to make this more user friendly? Would be
> easy, in the C code, to check if the PID has changed, and if so reseed
> the random number generator? I can open up a ticket for this if people
> think this is desirable (I think so).
>
> On a side note, there are a score of functions in numpy.random with
> __module__ to None. It makes it inconvenient to use it with
> multiprocessing (for instance it forced the creation of the 'test_helper'
> file here).
>
> Gaël
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>   
Part of this is one of the gotcha's of simulation that is not specific 
to multiprocessing and Python.  Just highly likely to occur in your case 
with multiprocessing but does occur in single processing. As David 
indicated, many applications use a single source (often computer time) 
to initialize the pseudo-random generators if an actual seed is not 
supplied. Depending on the resolution as most require an integer so 
minor changes may not be sufficient to change the seed. So the same seed 
will get used if the source has not sufficiently 'advanced' before the 
next initialization.

If you really care about reproducing the streams, you should specify the 
seed anyhow.

Bruce