[Numpy-discussion] numpy.random and multiprocessing
Bruce Southey
bsouthey at gmail.com
Thu Dec 11 11:20:48 EST 2008
Gael Varoquaux wrote:
> Hi there,
>
> I have been using the multiprocessing module a lot to do statistical tests
> such as Monte Carlo or resampling, and I have just discovered something
> that makes me wonder if I haven't been accumulating false results. Given
> two files:
>
> === test.py ===
> from test_helper import task
> from multiprocessing import Pool
>
> p = Pool(4)
>
> jobs = list()
> for i in range(4):
> jobs.append(p.apply_async(task, (4, )))
>
> print [j.get() for j in jobs]
>
> p.close()
> p.join()
>
> === test_helper.py ===
> import numpy as np
>
> def task(x):
> return np.random.random(x)
>
> =======
>
> If I run test.py, I get:
>
> [array([ 0.35773964, 0.63945684, 0.50855196, 0.08631373]), array([
> 0.35773964, 0.63945684, 0.50855196, 0.08631373]), array([ 0.35773964,
> 0.63945684, 0.50855196, 0.08631373]), array([ 0.65357725, 0.35649382,
> 0.02203999, 0.7591353 ])]
>
> In other words, the 4 processes give me the same exact results.
>
> Now I understand why this is the case: the different instances of the
> random number generator where created by forking from the same process,
> so they are exactly the very same object. This is howver a fairly bad
> trap. I guess other people will fall into it.
>
> The take home message is:
> **call 'numpy.random.seed()' when you are using multiprocessing**
>
> I wonder if we can find a way to make this more user friendly? Would be
> easy, in the C code, to check if the PID has changed, and if so reseed
> the random number generator? I can open up a ticket for this if people
> think this is desirable (I think so).
>
> On a side note, there are a score of functions in numpy.random with
> __module__ to None. It makes it inconvenient to use it with
> multiprocessing (for instance it forced the creation of the 'test_helper'
> file here).
>
> Gaël
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
Part of this is one of the gotcha's of simulation that is not specific
to multiprocessing and Python. Just highly likely to occur in your case
with multiprocessing but does occur in single processing. As David
indicated, many applications use a single source (often computer time)
to initialize the pseudo-random generators if an actual seed is not
supplied. Depending on the resolution as most require an integer so
minor changes may not be sufficient to change the seed. So the same seed
will get used if the source has not sufficiently 'advanced' before the
next initialization.
If you really care about reproducing the streams, you should specify the
seed anyhow.
Bruce
More information about the NumPy-Discussion
mailing list