[Numpy-discussion] Proposal: numpy.random.random_seed

Tue May 17 08:34:26 EDT 2016

On Tue, May 17, 2016 at 4:49 AM, Robert Kern <robert.kern at gmail.com> wrote:

> On Tue, May 17, 2016 at 9:09 AM, Stephan Hoyer <shoyer at gmail.com> wrote:
> >
> > On Tue, May 17, 2016 at 12:18 AM, Robert Kern <robert.kern at gmail.com>
> wrote:
> >>
> >> On Tue, May 17, 2016 at 4:54 AM, Stephan Hoyer <shoyer at gmail.com>
> wrote:
> >> > 1. When writing a library of stochastic functions that take a seed as
> an input argument, and some of these functions call multiple other such
> stochastic functions. Dask is one such example [1].
> >>
> >> Can you clarify the use case here? I don't really know what you are
> doing here, but I'm pretty sure this is not the right approach.
> >
> > Here's a contrived example. Suppose I've written a simulator for cars
> that consists of a number of loosely connected components (e.g., an engine,
> brakes, etc.). The behavior of each component of our simulator is
> stochastic, but we want everything to be fully reproducible, so we need to
> use seeds or RandomState objects.
> >
> > We might write our simulate_car function like the following:
> >
> > def simulate_car(engine_config, brakes_config, seed=None):
> >     rs = np.random.RandomState(seed)
> >     engine = simulate_engine(engine_config, seed=rs.random_seed())
> >     brakes = simulate_brakes(brakes_config, seed=rs.random_seed())
> >     ...
> >
> > The problem with passing the same RandomState object (either explicitly
> or dropping the seed argument entirely and using the  global state) to both
> simulate_engine and simulate_breaks is that it breaks encapsulation -- if I
> change what I do inside simulate_engine, it also effects the brakes.
>
> That's a little too contrived, IMO. In most such simulations, the
> different components interact with each other in the normal course of the
> simulation; that's why they are both joined together in the same simulation
> instead of being two separate runs. Unless if the components are being run
> across a process or thread boundary (a la dask below) where true
> nondeterminism comes into play, then I don't think you want these
> semi-independent streams. This seems to be the advice du jour from the
> agent-based modeling community.
>

similar usecase where I had to switch to using several RandomStates

In a Monte Carlo experiment with increasing sample size, I want two random
variables, x, y, to have the same the same draws in the common initial
observations.

If I draw x and y sequentially, and then increase the number of
observations for the simulation, then it completely changes the draws for
second variable if they use a common RandomState.

With separate random states, increasing from 1000 to 1200 observations,
leaves the first 1000 draws unchanged.
(This reduces the Monte Carlo noise for example when calculating the power
of a hypothesis test as function of the sample size.)

Josef

>
>
> > The dask use case is actually pretty different -- the intent is to
> create many random numbers in parallel using multiple threads or processes
> (possibly in a distributed fashion). I know that skipping ahead is the
> standard way to get independent number streams for parallel sampling, but
> that isn't exposed in numpy.random, and setting distinct seeds seems like a
> reasonable alternative for scientific computing use cases.
>
> Forget about integer seeds. Those are for human convenience. If you're not
> jotting them down in your lab notebook in pen, you don't want an integer
> seed.
>
> What you want is a function that returns many RandomState objects that are
> hopefully spread around the MT19937 space enough that they are essentially
> independent (in the absence of true jumpahead). The better implementation
> of such a function would look something like this:
>
> def spread_out_prngs(n, root_prng=None):
>     if root_prng is None:
>         root_prng = np.random
>     elif not isinstance(root_prng, np.random.RandomState):
>         root_prng = np.random.RandomState(root_prng)
>     sprouted_prngs = []
>     for i in range(n):
>         seed_array = root_prng.randint(1<<32, size=624)  # dtype=np.uint32
> under 1.11
>         sprouted_prngs.append(np.random.RandomState(seed_array))
>     return spourted_prngs
>
> Internally, this generates seed arrays of about the size of the MT19937
> state so make sure that you can access more of the state space. That will
> at least make the chance of collision tiny. And it can be easily rewritten
> to take advantage of one of the newer PRNGs that have true independent
> streams:
>
>   https://github.com/bashtage/ng-numpy-randomstate
>
> --
> Robert Kern
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20160517/f70f7334/attachment.html>