[Numpy-discussion] NEP: Random Number Generator Policy

Sun Jun 10 23:01:20 EDT 2018

On Sun, Jun 10, 2018 at 6:08 PM, Robert Kern <robert.kern at gmail.com> wrote:

> On Sun, Jun 10, 2018 at 5:27 PM Ralf Gommers <ralf.gommers at gmail.com>
> wrote:
> >
> > On Mon, Jun 4, 2018 at 3:18 PM, Robert Kern <robert.kern at gmail.com>
> wrote:
> >>
> >> On Sun, Jun 3, 2018 at 8:22 PM Ralf Gommers <ralf.gommers at gmail.com>
> wrote:
> >>>
> >>> It may be worth having a look at test suites for scipy, statsmodels,
> scikit-learn, etc. and estimate how much work this NEP causes those
> projects. If the devs of those packages are forced to do large scale
> migrations from RandomState to StableState, then why not instead keep
> RandomState and just add a new API next to it?
> >>
> >> The problem is that we can't really have an ecosystem with two
> different general purpose systems.
> >
> > Can't = prefer not to.
>
> I meant what I wrote. :-)
>
> > But yes, that's true. That's not what I was saying though. We want one
> generic one, and one meant for unit testing only. You can achieve that in
> two ways:
> > 1. Change the current np.random API to new generic, and add a new
> RandomStable for unit tests.
> > 2. Add a new generic API, and document the current np.random API as
> being meant for unit tests only, for other usage <new API> should be
> preferred.
> >
> > (2) has a couple of pros:
> > - you're not forcing almost every library and end user out there to
> migrate their unit tests.
>
> But it has the cons that I talked about. RandomState *is* a fully
> functional general purpose PRNG system. After all, that's its current use.
> Documenting it as intended to be something else will not change that fact.
> Documentation alone provides no real impetus to move to the new system
> outside of the unit tests. And the community does need to move together to
> the new system in their library code, or else we won't be able to combine
> libraries together; these PRNG objects need to thread all the way through
> between code from different authors if we are to write programs with a
> controlled seed. The failure mode when people don't pay attention to the
> documentation is that I can no longer write programs that compose these
> libraries together. That's why I wrote "can't". It's not a mere preference
> for not having two systems to maintain. It has binary Go/No Go implications
> for building reproducible programs.
>

I strongly suspect you are right, but only because you're asserting "can't"
so heavily. I have trouble formulating what would go wrong in case there's
two PRNGs used in a single program. It's not described in the NEP, nor in
the numpy.random docs (those don't even have any recommendations for best
practices listed as far as I can tell - that needs fixing). All you explain
in the NEP is that reproducible research isn't helped by the current
stream-compat guarantee. So a bit of (probably incorrect) devil's advocate
reasoning:
- If there's no stream-compat guarantee, all a user can rely on is the
properties of drawing from a seeded PRNG.
- Any use of a PRNG in library code can also only rely on properties
- So now whether in a user's program libraries draw from one or two seeded
PRNGs doesn't matter for reproducibility, because those properties don't
change.

Also, if there is to be a multi-year transitioning to the new API, would
there be two PRNG systems anyway during those years?

> > - more design freedom for the new generic API. The current one is
> clearly sub-optimal; in a new one you wouldn't have to expose all the
> global state/functions that np.random exposes now. You could even restrict
> it to a single class and put that in the main numpy namespace.
>
> I'm not sure why you are talking about the global state and np.random.*
> convenience functions. What we do with those functions is out of scope for
> this NEP and would be talked about it another NEP fully introducing the new
> system.
>

To quote you from one of the first emails in this thread: "
I deliberately left it out of this one as it may, depending on our choices,
impinge upon the design of the new PRNG subsystem, which I declared out of
scope for this NEP. I have ideas (besides the glib "Let them eat
AttributeErrors!"), and now that I think more about it, that does seem like
it might be in scope just like the discussion of freezing RandomState and
StableRandom are. But I think I'd like to hold that thought a little bit
and get a little more screaming^Wfeedback on the core proposal first. I'll
return to this in a few days if not sooner.
"

So consider this some screaming^Wfeedback:)

>
> >> To properly use pseudorandom numbers, I need to instantiate a PRNG and
> thread it through all of the code in my program: both the parts that I
> write and the third party libraries that I don't write.
> >>
> >> Generating test data for unit tests is separable, though. That's why I
> propose having a StableRandom built on the new architecture. Its purpose
> would be well-documented, and in my proposal is limited in features such
> that it will be less likely to be abused outside of that purpose. If you
> make it fully-featured, it is more likely to be abused by building library
> code around it. But even if it is so abused, because it is built on the new
> architecture, at least I can thread the same core PRNG state through the
> StableRandom distributions from the abusing library and use the better
> distributions class elsewhere (randomgen names it "Generator"). Just
> keeping RandomState around can't work like that because it doesn't have a
> replaceable core PRNG.
> >>
> >> But that does suggest another alternative that we should explore:
> >>
> >> The new architecture separates the core uniform PRNG from the wide
> variety of non-uniform probability distributions. That is, the core PRNG
> state is encapsulated in a discrete object that can be shared between
> instances of different distribution-providing classes. numpy.random should
> provide two such distribution-providing classes. The main one (let us call
> it ``Generator``, as it is called in the prototype) will follow the new
> policy: distribution methods can break the stream in feature releases.
> There will also be a secondary distributions class (let us call it
> ``LegacyGenerator``) which contains distribution methods exactly as they
> exist in the current ``RandomState`` implementation. When one combines
> ``LegacyGenerator`` with the MT19937 core PRNG, it should reproduce the
> exact same stream as ``RandomState`` for all distribution methods. The
> ``LegacyGenerator`` methods will be forever frozen.
> ``numpy.random.RandomState()`` will instantiate a ``LegacyGenerator`` with
> the MT19937 core PRNG, and whatever tricks needed to make
> ``isinstance(prng, RandomState)`` and unpickling work should be done. This
> way of creating the ``LegacyGenerator`` by way of ``RandomState`` will be
> deprecated, becoming progressively noisier over a number of release cycles,
> in favor of explicitly instantiating ``LegacyGenerator``.
> >>
> >> ``LegacyGenerator`` CAN be used during this deprecation period in
> library and application code until libraries and applications can migrate
> to the new ``Generator``. Libraries and applications SHOULD migrate but
> MUST NOT be forced to. ``LegacyGenerator`` CAN be used to generate test
> data for unit tests where cross-release stability of the streams is
> important. Test writers SHOULD consider ways to mitigate their reliance on
> such stability and SHOULD limit their usage to distribution methods that
> have fewer cross-platform stability risks.
>
> I would appreciate your consideration of this proposal. Does it address
> your concerns? It addresses my concerns with keeping around a
> fully-functional RandomState implementation.
>

My concerns are:
1. The amount of work caused by making libraries and end users migrate.
2. That this is a backwards compatibility break, which will cause problems
for users who relied on the old guarantees (the arguments in the NEP that
the old guarantees weren't 100% watertight don't mean that backcompat
doesn't matter at all).

As far as I can tell, this new proposal doesn't deal with those concerns
directly. What it does seem to do is making transitioning a bit easier for
users that were already using RandomState instances.

Cheers,
Ralf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20180610/cf674237/attachment-0001.html>