[Numpy-discussion] NEP: Random Number Generator Policy

Mon Jun 11 02:43:33 EDT 2018

On Sun, Jun 10, 2018 at 10:36 PM, Robert Kern <robert.kern at gmail.com> wrote:

> On Sun, Jun 10, 2018 at 8:04 PM Ralf Gommers <ralf.gommers at gmail.com>
> wrote:
> >
> > On Sun, Jun 10, 2018 at 6:08 PM, Robert Kern <robert.kern at gmail.com>
> wrote:
> >>
> >> On Sun, Jun 10, 2018 at 5:27 PM Ralf Gommers <ralf.gommers at gmail.com>
> wrote:
> >> >
> >> > On Mon, Jun 4, 2018 at 3:18 PM, Robert Kern <robert.kern at gmail.com>
> wrote:
> >> >>
> >> >> On Sun, Jun 3, 2018 at 8:22 PM Ralf Gommers <ralf.gommers at gmail.com>
> wrote:
> >> >>>
> >> >>> It may be worth having a look at test suites for scipy,
> statsmodels, scikit-learn, etc. and estimate how much work this NEP causes
> those projects. If the devs of those packages are forced to do large scale
> migrations from RandomState to StableState, then why not instead keep
> RandomState and just add a new API next to it?
> >> >>
> >> >> The problem is that we can't really have an ecosystem with two
> different general purpose systems.
> >> >
> >> > Can't = prefer not to.
> >>
> >> I meant what I wrote. :-)
> >>
> >> > But yes, that's true. That's not what I was saying though. We want
> one generic one, and one meant for unit testing only. You can achieve that
> in two ways:
> >> > 1. Change the current np.random API to new generic, and add a new
> RandomStable for unit tests.
> >> > 2. Add a new generic API, and document the current np.random API as
> being meant for unit tests only, for other usage <new API> should be
> preferred.
> >> >
> >> > (2) has a couple of pros:
> >> > - you're not forcing almost every library and end user out there to
> migrate their unit tests.
> >>
> >> But it has the cons that I talked about. RandomState *is* a fully
> functional general purpose PRNG system. After all, that's its current use.
> Documenting it as intended to be something else will not change that fact.
> Documentation alone provides no real impetus to move to the new system
> outside of the unit tests. And the community does need to move together to
> the new system in their library code, or else we won't be able to combine
> libraries together; these PRNG objects need to thread all the way through
> between code from different authors if we are to write programs with a
> controlled seed. The failure mode when people don't pay attention to the
> documentation is that I can no longer write programs that compose these
> libraries together. That's why I wrote "can't". It's not a mere preference
> for not having two systems to maintain. It has binary Go/No Go implications
> for building reproducible programs.
> >
> > I strongly suspect you are right, but only because you're asserting
> "can't" so heavily. I have trouble formulating what would go wrong in case
> there's two PRNGs used in a single program. It's not described in the NEP,
> nor in the numpy.random docs (those don't even have any recommendations for
> best practices listed as far as I can tell - that needs fixing). All you
> explain in the NEP is that reproducible research isn't helped by the
> current stream-compat guarantee. So a bit of (probably incorrect) devil's
> advocate reasoning:
> > - If there's no stream-compat guarantee, all a user can rely on is the
> properties of drawing from a seeded PRNG.
> > - Any use of a PRNG in library code can also only rely on properties
> > - So now whether in a user's program libraries draw from one or two
> seeded PRNGs doesn't matter for reproducibility, because those properties
> don't change.
>
> Correctly making a stochastic program reproducible while retaining good
> statistical properties is difficult. People don't do it well in the best of
> circumstances. The best way that we've found to manage that difficulty is
> to instantiate a single stream and use it all throughout your code. Every
> new stream requires the management of more seeds (unless if we use the
> fancy new algorithms that have settable stream IDs, but by stipulation, we
> don't have these in this case). And now I have to thread both of these
> objects through my code, and pass the right object to each third-party
> library. These third-party libraries don't know anything about this weird
> 2-stream workaround that you are doing, so we now have libraries that can't
> build on each other unless if they are using the same compatible API, even
> if I can make workarounds to build a program that combines two libraries
> side-to-side.
>
> So yeah, people "can" do this. "It's just a matter of code" as my boss
> likes to say. But it's making an already-difficult task more difficult.
>

Okay, that makes more sense to me now. It would be really useful to
document such best practices and rationales.

Note that scipy.stats distributions allow passing in either a RandomState
instance or an integer as seed (which will be used for seeding a new
instance, not for np.random.seed) [1]. That seems like a fine design
pattern as well, and passing on a seed that way is fairly easy and as good
for reproducibility as passing in a single PRNG.

[1]
https://github.com/scipy/scipy/blob/master/scipy/stats/_distn_infrastructure.py#L612

> > Also, if there is to be a multi-year transitioning to the new API, would
> there be two PRNG systems anyway during those years?
>
> Sure, but with a deadline and not-just-documentation to motivate
> transitioning.
>
> But if we follow my alternative proposal, there'll be no need for
> deprecation! You've convinced me to not deprecate RandomState.
>

That's not how I had read it, but great to hear that!

I just want to change some of its internal implementation details, add a
> less-stable set of distributions on the side, and a framework of core
> uniform PRNGs that can be shared by both.
>
> >> > - more design freedom for the new generic API. The current one is
> clearly sub-optimal; in a new one you wouldn't have to expose all the
> global state/functions that np.random exposes now. You could even restrict
> it to a single class and put that in the main numpy namespace.
> >>
> >> I'm not sure why you are talking about the global state and np.random.*
> convenience functions. What we do with those functions is out of scope for
> this NEP and would be talked about it another NEP fully introducing the new
> system.
> >
> > To quote you from one of the first emails in this thread: "
> > I deliberately left it out of this one as it may, depending on our
> choices, impinge upon the design of the new PRNG subsystem, which I
> declared out of scope for this NEP. I have ideas (besides the glib "Let
> them eat AttributeErrors!"), and now that I think more about it, that does
> seem like it might be in scope just like the discussion of freezing
> RandomState and StableRandom are. But I think I'd like to hold that thought
> a little bit and get a little more screaming^Wfeedback on the core proposal
> first. I'll return to this in a few days if not sooner.
> > "
> >
> > So consider this some screaming^Wfeedback:)
>
> Ahem. Yes, I just remembered I said that. :-) But still, there will be
> lots of options about what to do with np.random.*, whatever proposal we go
> with. It doesn't really impose constraints on the core proposals.
>
> >> >> To properly use pseudorandom numbers, I need to instantiate a PRNG
> and thread it through all of the code in my program: both the parts that I
> write and the third party libraries that I don't write.
> >> >>
> >> >> Generating test data for unit tests is separable, though. That's why
> I propose having a StableRandom built on the new architecture. Its purpose
> would be well-documented, and in my proposal is limited in features such
> that it will be less likely to be abused outside of that purpose. If you
> make it fully-featured, it is more likely to be abused by building library
> code around it. But even if it is so abused, because it is built on the new
> architecture, at least I can thread the same core PRNG state through the
> StableRandom distributions from the abusing library and use the better
> distributions class elsewhere (randomgen names it "Generator"). Just
> keeping RandomState around can't work like that because it doesn't have a
> replaceable core PRNG.
> >> >>
> >> >> But that does suggest another alternative that we should explore:
> >> >>
> >> >> The new architecture separates the core uniform PRNG from the wide
> variety of non-uniform probability distributions. That is, the core PRNG
> state is encapsulated in a discrete object that can be shared between
> instances of different distribution-providing classes. numpy.random should
> provide two such distribution-providing classes. The main one (let us call
> it ``Generator``, as it is called in the prototype) will follow the new
> policy: distribution methods can break the stream in feature releases.
> There will also be a secondary distributions class (let us call it
> ``LegacyGenerator``) which contains distribution methods exactly as they
> exist in the current ``RandomState`` implementation. When one combines
> ``LegacyGenerator`` with the MT19937 core PRNG, it should reproduce the
> exact same stream as ``RandomState`` for all distribution methods. The
> ``LegacyGenerator`` methods will be forever frozen.
> ``numpy.random.RandomState()`` will instantiate a ``LegacyGenerator`` with
> the MT19937 core PRNG, and whatever tricks needed to make
> ``isinstance(prng, RandomState)`` and unpickling work should be done. This
> way of creating the ``LegacyGenerator`` by way of ``RandomState`` will be
> deprecated, becoming progressively noisier over a number of release cycles,
> in favor of explicitly instantiating ``LegacyGenerator``.
> >> >>
> >> >> ``LegacyGenerator`` CAN be used during this deprecation period in
> library and application code until libraries and applications can migrate
> to the new ``Generator``. Libraries and applications SHOULD migrate but
> MUST NOT be forced to. ``LegacyGenerator`` CAN be used to generate test
> data for unit tests where cross-release stability of the streams is
> important. Test writers SHOULD consider ways to mitigate their reliance on
> such stability and SHOULD limit their usage to distribution methods that
> have fewer cross-platform stability risks.
> >>
> >> I would appreciate your consideration of this proposal. Does it address
> your concerns? It addresses my concerns with keeping around a
> fully-functional RandomState implementation.
> >
> > My concerns are:
> > 1. The amount of work caused by making libraries and end users migrate.
> > 2. That this is a backwards compatibility break, which will cause
> problems for users who relied on the old guarantees (the arguments in the
> NEP that the old guarantees weren't 100% watertight don't mean that
> backcompat doesn't matter at all).
> >
> > As far as I can tell, this new proposal doesn't deal with those concerns
> directly. What it does seem to do is making transitioning a bit easier for
> users that were already using RandomState instances.
>
> Let me drop the deprecation of the name RandomState. RandomState(int_seed)
> will forever and always create a backwards- and stream-compatible object.
> No one will have to migrate.
>
> How does that strike you?
>

Sounds good.

Cheers,
Ralf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20180610/fb590ed6/attachment-0001.html>