Unsorting(randomizing) a sequence

Sun Aug 22 11:11:07 EDT 1999

In article <000401beebac$7a89b520$0c2d2399 at tim>,
Tim Peters <tim_one at email.msn.com> wrote:
>[Tim, on the difficulty of generating random permutations, because
> n! quickly grows larger than a given random-# generator's period]
>
>[Aahz Maruch]
>> What happens if you re-seed?
>
>By what means?  If you have a good source of random bits to do the reseeding
>with, save yourself the expense of calling random() and just use them
>directly <0.1 wink>.  If your seed generation isn't random, at best you can
>boost the period of the combined generation scheme; at worst, damage the
>pseudo-randomness of both streams.  "It depends", and on everything.

I'm just using the standard time.time() mechanism.  

>> I'm curious because I just wrote some code that re-seeds every
>> hundred calls to random().  In my case, though, it's necessary
>> because I actually don't care that much about randomness per se;
>> I'm using it to generate unique numbers (with duplicate detection),
>> and I need to ensure that if I ever do happen on the same random
>> number sequence as a previous run of the process, it gets aborted
>> reasonably quickly.
>
>Probably better to skip the reseeding, saving random()'s state at
>the end of one run and restoring it at the start of the next --
>random.random()'s period is in the neighborhood of a trillion; it's
>not going to repeat before your current employer goes out of business
><wink>.

That would be a great idea if runs "ended"; however, this is intended to
be a daemon-style process, and it crashes *much* more often than it
ends.  :-(  It certainly wasn't worth the time to create a full-blown
save/restore mechanism for a quick hack.

The reseeding is just a hack on top of a hack as a backstop against the
possibility that time.time() on the start of a new run would duplicate
the seed for a previous run -- I didn't want to see the process spinning
its wheels for however long it took to work through the sequence to the
old ending point.

>But, so far, you haven't said anything that explains why you can't just
>repeatedly add 1 to a global counter!  Guaranteed unique without bothering
>to check for duplicates.  Something else must be driving this.

I don't know what other issues came up, but there's the problem that
currently at least two different processes generate these "serial
numbers".  Could have used the database to store the IDs, of course, but
then you run into the locking problem.

Oh, yeah, I remember another issue.  There's another place where we
expose a different set of serial numbers to the public, and we did *not*
want a way for someone to identify where they were in the sequence.
Until I created this hack, they both used the same serial generation
code.

>> (This is a temporary hack to deal with an unfortunate choice in
>> hashing functions until we can switch to some code that uses SQL
>> Server GUIDs.  And, yes, I did find out about Python's hash()
>> function *after* I wrote this....  <sigh>)
>
>Check out the md5 module for a much stronger (& much slower -- win-win
><wink>)128-bit hash; e.g.,

Yup.  I believe that we initially wanted to use 32-bit ints to make
accessing the database easier; part of the reason we're switching over
is that we didn't realize just how quickly we'd be filling up the 32-bit
address space.  I think I'll suggest to my coworkers that MD5 is more
portable than GUIDs, even though it'll be a bit slower.

Here's another question: I know the *period* of whrandom() is very long,
but what is the granularity/precision of it?  I wasn't sure of the
answer, so I partitioned the 32-bit space into a thousand chunks, making
two calls to random().
--
                      --- Aahz (@netcom.com)

Androgynous poly kinky vanilla queer het    <*>      http://www.rahul.net/aahz/
Hugs and backrubs -- I break Rule 6  (if you want to know, do some research)