[Numpy-discussion] SeedSequence.spawn()

Robert Kern robert.kern at gmail.com
Fri Aug 27 10:59:52 EDT 2021


joblib is a library that uses clever caching of function call results to
make the development of certain kinds of data-heavy computational pipelines
easier. In order to derive the key to be used to check the cache, joblib
has to look at the arguments passed to the function, which may
involve usually-nonhashable things like large numpy arrays.

  https://joblib.readthedocs.io/en/latest/

So they constructed joblib.hash() which basically takes the arguments,
pickles them into a bytestring (with some implementation details), then
computes an MD5 hash on that. It's probably overkill for your keys, but
it's easily available and quite generic. It returns a hex-encoded string of
the 128-bit MD5 hash. `int(..., 16)` will convert that to a non-negative
(almost-certainly positive!) integer that can be fed into SeedSequence.

On Fri, Aug 27, 2021 at 5:03 AM Stig Korsnes <stigkorsnes at gmail.com> wrote:

> Thank you Robert!
> This scheme fits perfectly into what I`m trying to accomplish! :) The
> "smooshing" of ints by supplying a list of ints had eluded me. Thank you
> also for the pointer about built-in hash(). I would not be able to rely on
> it anyways, because it does not return strictly positive ints which
> SeedSequence requires.  If you have a minute to spare: Could you briefly
> explain "int(joblib.hash(key)
> <https://joblib.readthedocs.io/en/latest/generated/joblib.hash.html>,
> 16)" , and would this always return non-negative integers?
> Thanks again!
>
> tor. 26. aug. 2021 kl. 22:59 skrev Robert Kern <robert.kern at gmail.com>:
>
>> On Thu, Aug 26, 2021 at 2:22 PM Stig Korsnes <stigkorsnes at gmail.com>
>> wrote:
>>
>>> Hi,
>>> Is there a way to uniquely spawn child seeds?
>>> I`m doing monte carlo analysis, where I have n random processes, each
>>> with their own generator.
>>> All process models instantiate a generator with default_rng(). I.e
>>> ss=SeedSequence() cs=ss.Spawn(n), and using cs[i] for process i. Now, the
>>> problem I`m facing, is that results using individual process  depends on
>>> the order of the process initialization ,and the number of processes used.
>>> However, if I could spawn children with a unique identifier, I would be
>>> able to reproduce my individual results without having to pickle/log
>>> states. For example, all my models have an id (tuple) field which is
>>> hashable.
>>> If I had the ability to SeedSequence(x).Spawn([objects]) where objects
>>> support hash(object), I would have reproducibility for all my processes. I
>>> could do without the spawning, but then I would probably loose independence
>>> when I do multiproc? Is there a way to achieve my goal in the current
>>> version 1.21 of numpy?
>>>
>>
>> I would probably not rely on `hash()` as it is only intended to be pretty
>> good at getting distinct values from distinct inputs. If you can combine
>> the tuple objects into a string of bytes in a reliable, collision-free way
>> and use one of the cryptographic hashes to get them down to a 128bit
>> number, that'd be ideal. `int(joblib.hash(key)
>> <https://joblib.readthedocs.io/en/latest/generated/joblib.hash.html>,
>> 16)` should do nicely. You can combine that with your main process's seed
>> easily. SeedSequence can take arbitrary amounts of integer data and smoosh
>> them all together. The spawning functionality builds off of that, but you
>> can also just manually pass in lists of integers.
>>
>> Let's call that function `stronghash()`. Let's call your main process
>> seed number `seed` (this is the thing that the user can set on the
>> command-line or something you get from `secrets.randbits(128)` if you need
>> a fresh one). Let's call the unique tuple `key`. You can build the
>> `SeedSequence` for each job according to the `key` like so:
>>
>> root_ss = SeedSequence(seed)
>> for key, data in jobs:
>>     child_ss = SeedSequence([stronghash(key), seed])
>>     submit_job(key, data, seed=child_ss)
>>
>> Now each job will get its own unique stream regardless of the order the
>> job is assigned. When the user reruns it with the same root `seed`, they
>> will get the same results. When the user chooses a different `seed`, they
>> will get another set of results (this is why you don't want to just use
>> `SeedSequence(stronghash(key))` all by itself).
>>
>> I put the job-specific seed data ahead of the main program's seed to be
>> on the super-safe side. The spawning mechanism will append integers to the
>> end, so there's a super-tiny chance somewhere down a long line of
>> `root_ss.spawn()`s that there would be a collision (and I mean
>> super-extra-tiny). But best practices cost nothing.
>>
>> I hope that helps and is not too confusing!
>>
>> --
>> Robert Kern
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
>

-- 
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/numpy-discussion/attachments/20210827/3c302881/attachment.html>


More information about the NumPy-Discussion mailing list