[ python-Bugs-917055 ] add a stronger PRNG

Thu Mar 25 03:25:51 EST 2004

Bugs item #917055, was opened at 2004-03-15 21:46
Message generated for change (Comment added) made by rhettinger
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=917055&group_id=5470

Category: Python Library
Group: Feature Request
Status: Open
Resolution: None
Priority: 5
Submitted By: paul rubin (phr)
Assigned to: Raymond Hettinger (rhettinger)
Summary: add a stronger PRNG

Initial Comment:
The default Mersenne Twister algorithm in the Random
module is very fast but makes no serious attempt to
generate output that stands up to adversarial analysis.
 Besides cryptography applications, this can be a
serious problem in areas like computer games.  Sites
like www.partypoker.com routinely run online
tournaments with prize funds of 100K USD or more. 
There&#039;s big financial incentives to find ways of
guessing your opponent&#039;s cards with better than random
chance probability.  See bug #901285 for some
discussion of possible correlations in Mersenne
Twister&#039;s output.

Teukolsky et al discuss PRNG issues at some length in
their book "Numerical Recipes".  The original edition
of Numerical Recipes had a full blown version of the
FIPS Data Encryption Standard implemented horrendously
in Fortran, as a way of making a PRNG with no easily
discoverable output correlations.  Later editions
replaced the DES routine with a more efficient one
based on similar principles.

Python already has an SHA module that makes a dandy
PRNG.  I coded a sample implementation:

http://www.nightsong.com/phr/python/sharandom.py

I&#039;d like to ask that the Python lib include something
like this as an alternative to MT.  It would be similar
to the existing whrandom module in that it&#039;s an
alternative subclass to the regular Random class.  The
existing Random module wouldn&#039;t have to be changed.

I don&#039;t propose directly including the module above,
since I think the Random API should also be extended to
allow directly requesting pseudo-random integers from
the generator subclass, rather than making them from
floating-point output.  That would allow making the
above subclass work more cleanly.  I&#039;ll make a separate
post about this, but first will have to examine the
Random module source code.

----------------------------------------------------------------------

>Comment By: Raymond Hettinger (rhettinger)
Date: 2004-03-25 03:25

Message:
Logged In: YES 
user_id=80475

1.  We don't care about non IEEE.

2.  One-way is one-way, so double hashing is unnecessary. 
Also, I've fielded bug reports from existing user apps that
re-seed very frequently (multiple times per second).

3.  The implementations reflect the results of timing
experiments which showed that the fastest intermediate
representation was hex.

4.  ['0x0'] is necessary to hanlde the case where n==0. 
int('', 16) raises a ValueError while int('0x0', 16) does not.

5.  random() and getrandbits() do not have to go through the
same intermediate steps (it's okay for one to use hex and
the other to use str) -- speed and space issues dominate. 
0x comes up enough in Python, there is little use in tossing
it away for an obscure, hypothetical micro-controller
implementation.

6.  Leaving cnt out of the s1 computation guarantees that it
will never track the updates of s0 -- any syncronization
would be a disaster.  Adding a count or some variant smacks
of desperation rather than reliance on proven hash properties.

7.  The function is already 100 times slower than MT. 
Adding locks will make the situation worse.  It is better to
simply document it as being non-threadsafe.  

Look at back at the mt/sha version.  Its code is much
cleaner, faster, and threadsafe.  It goes a long way towards
meeting your 
request and serving as an alternate generator to validate
simulation results.

If we use the sha/sha version, I'm certain that we will be
fielding bug reports on this through 2010.  It is also
sufficiently complex that it will spawn lengthy, wasteful
discussions and it will create a mine-field for future
maintainers. 

----------------------------------------------------------------------

Comment By: paul rubin (phr)
Date: 2004-03-25 00:15

Message:
Logged In: YES 
user_id=72053

I'm not sure why the larger state space of mt/sha1 is
supposed to be an advantage, but the other points are
reasonable.  I like the new sha1/sha1 implementation except
for a few minor points (below).  I made the old version
hopefully threadsafe by adding a threading.Lock() object to
the instance and locking it on update. I generally like your
version better so maybe the lock can be added.  Of course
that slows it down even further, but in the context of an
interpreted Python program I feel that the generator is
still fast enough compared to the other stuff the program is
likely to be doing.

Re the new attachment, some minor issues:

1) The number 2.0**-160 is < 10**-50.  This is a valid IEEE
double but on some non-IEEE machines it may be a floating
underflow or even equal to zero.  I don't know if this matters.

2) Paranoia led me to hash the seed twice in the seed
operation in my version, to defeat unlikely
message-extension attacks against the hash function.  I
figured reseeding is infrequent enough that an extra hash
operation doesn't matter.

3) Storing s1 as a string of 40 hex digits in SHARandom2
means that s1+s2 is 60 characters, which means hashing it
will need two sha1 compression operations, slowing it down some.

4) the intiialization of ciphertxt to ["0x0"] instead of []
doesn't seem to do anything useful.  int('123abc', 16) is
valid without the 0x prefix.

5) random() uses hex(cnt) while getrandbits uses str(cnt)
(converting to decimal instead of hex).  I think it's better
to use hex and remove the 0x prefix from the output, which
is cleanest, and simpler to implement on some targets
(embedded microcontroller).  The same goes for the %s
conversion in jumpahead (my version also uses %s there).

6) It may be worthwhile to include cnt in both the s0 and s1
hash updates.  That guarantees the s1 hash never gets the
same input twice.

7) The variable "s1" in getrandbits (instead of self.s1) is
set but never used.

Note in my version of sharandom.py, I didn't put a thread
lock around the tuple assignment in setstate().  I'm not
sure if that's safe or not.  But it looks to me like
random.py in the CVS does the same thing, so maybe both are
unsafe.

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2004-03-24 06:31

Message:
Logged In: YES 
user_id=80475

I attached a new version of the mt/sha1 combination.

Here are the relative merits as compared sha1/sha1 approach:
* simpiler to implement/maintain since state tracking is builtin
* larger state space (2**19937 vs 2**160)
* faster
* threadsafe

Favoring sha1/sha1:
* uses only one primitive
* easier to replace in situations where MT is not available

----------------------------------------------------------------------

Comment By: paul rubin (phr)
Date: 2004-03-21 19:56

Message:
Logged In: YES 
user_id=72053

Two small corrections to below: 1) "in favor of an entropy"
is an editing error--the intended meaning should be obvious.
 2) I meant Bryan Mongeau, not Eric Mongeau.  Bryan's lib is
at <http://eevolved.com/cryptkit/>.  Sorry for any confusion.

----------------------------------------------------------------------

Comment By: paul rubin (phr)
Date: 2004-03-21 19:49

Message:
Logged In: YES 
user_id=72053

I'm very much in favor of an entropy and Bram's suggested
interface of entropy(x) supplying x bytes of randomness is
fine.  Perhaps it should really live in a cryptography API
rather than in the Random API, but either way is ok.  Mark
Moraes wrote a Windows module that gets random numbers from
the Windows CAPI; I put up a copy at

     http://www.nightsong.com/phr/python/winrand.module

For Linux, Cygwin, and *BSD (including MacOS 10, I think),
just read /dev/urandom for random bytes.  However, various
other systems (I think including Solaris) don't include
anything like this.  OpenSSL has an entropy gathering daemon
that might be of some use in that situation.  There's also
the Yarrow generator (http://www.schneier.com/yarrow.html)
and Eric Mongeau(?) wrote  a pure-Python generator a while
back that tried to gather entropy from thread racing,
similar to Java's default SecureRandom class (I consider
that method to be a bit bogus in both Python and Java).

I think, though, simply supporting /dev/*random and the
Windows CAPI is a pretty good start, even if other OS's
aren't supported.  Providing that function in the Python lib
will make quite a few people happy.  A single module
integrating both methods would be great.  I don't have any
Windows dev tools so can't test any wrappers for Mark
Moraes's function but maybe one of you guys can do it.

I'm not too keen on the md5random.py patch for reasons
discussed in the c.l.py thread.  It depends too deeply on
the precise characteristics of both md5 and Mersenne
Twister.  I think for a cryptography-based generator, it's
better to stick to one cryptography-based primitive, and to
use sha instead of md5.  That also helps portability since
it means other environments (maybe including Jython) can
reproduce the PRNG stream without having to re-implement MT,
as long as they have SHA (which is a US federal standard).

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2004-03-21 14:50

Message:
Logged In: YES 
user_id=80475

Bram, if you have a patch, I would be happy to look at it. 
Please make it as platform independent as possible (its okay
to have it conditionally compile differently so long as the
API stays the same).  
Submit it as a separate patch and assign to me -- it doesn't
have to be orginal, you can google around to determine prior
art.

----------------------------------------------------------------------

Comment By: Bram Cohen (bram_cohen)
Date: 2004-03-21 14:34

Message:
Logged In: YES 
user_id=52561

The lack of a 'real' entropy source is the gap which can't
be fixed with an application-level bit of code. I think
there are simple hooks for this on all systems, such as
/dev/random on linux, but they aren't cross platform. A
unified API which always calls the native entropy hook would
be a very good thing.

An example of a reasonable API would be to have a module
named entropy, with a single function entropy(x) which
returns a random string of length x. This is a problem which
almost anyone writing a security-related application runs
into, and lots of time is wasted writing dubious hacks to
harvest entropy when a single simple library could magically
solve it the right way.

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2004-03-21 12:33

Message:
Logged In: YES 
user_id=80475

Attaching my alternative.  If it fulfills your use case, let
me know and I'll apply it.

----------------------------------------------------------------------

Comment By: paul rubin (phr)
Date: 2004-03-18 03:51

Message:
Logged In: YES 
user_id=72053

Updated version of sharandom.py is at same url.  It folds a
counter into the hash and also includes a getrandbits method.

----------------------------------------------------------------------

Comment By: paul rubin (phr)
Date: 2004-03-18 01:27

Message:
Logged In: YES 
user_id=72053

I don't mind dropping the time() auto-seed but I think that
means eliminating any auto-seed, and requiring a
user-supplied seed.  

There is no demonstrable minimum period for the SHA-OFB and
it would be bad if there was, since it would then no longer
act like a PRF.  Note that the generator in the sample code
actually comes from two different SHA instances and thus its
expected period is about 2**160.  Anyway, adding a simple
counter (incrementing by 1 on every SHA call) to the SHA
input removes any lingering chance of a repeating sequence.
 I'll update the code to do that.  It's much less ugly than
stirring in Mersenne Twister output.

I don't have Python 2.4 yet but will look at it when I get a
chance.

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2004-03-17 06:27

Message:
Logged In: YES 
user_id=80475

It has some potential for being added.  That being said, try
to avoid a religious fervor.

I would like to see considerably more discussion and
evolution first.

The auto-seeding with time *must* be dropped -- it is not
harmonious with the goal of creating a secure random
sequence.  It is okay for the a subclass to deviate in this way.

Also, I was soliciting references stronger than (I don't
know of any results ... It is generally considered ... ). 
If we put this in, people are going to rely on it.  The docs
*must* include references indicating the strengths and
weaknesses of the approach.  It should also concisely say
why it works (a summary proof that makes it clear how a
one-way digest function can be tranformed into a sequence
generator that is cryptographicly strong to both the left
and right with the latter being the one that is not obvious).

Not having a demonstrable minimum period is also bad. 
Nothing in the  discussion so far precludes the existence of
a bad seed that has a period of only 1 or 2.  See my
suggestion on comp.lang.python for a means of mitigating
this issue.

With respect to the randint question, be sure to look at the
current Py2.4 source for random.py.  The API is expanded to
include and an option method, getrandbits().  That in turn
feeds the other integer methods without the int to float to
int dance.

Upon further consideration, I think the export control
question is moot since we're using an existing library
function and not really expressing new art.

----------------------------------------------------------------------

Comment By: paul rubin (phr)
Date: 2004-03-17 04:05

Message:
Logged In: YES 
user_id=72053

There are no research results that I know of that can
distinguish the output of sha1-ofb from random output in any
practical way.  To find such a distinguisher would be a
significant result.  It's a safe bet that cryptographers
have searched for distinguishers, though I don't promise to
have heard of every result that's been published.  I'll ask
on sci.crypt if anyone else has heard of such a thing, if
you want.  SHA1-HMAC is generally considered to be
indistinguishable from a PRF (pseudorandom function, i.e. a
function selected at random from the space of all functions
from arbitrary strings to 160-bit outputs; this term is from
concrete security theory).  MD5 is deprecated these days and
there's no point to using it instead of sha1 for this.

I'm not sure what happens if randint is added to the API. 
If you subclass Random and don't provide a randint method,
you inherit from the base class, which can call
self.random() to get floats to make the ints from.

US export restrictions basically don't exist any more.  In
principle, if you want to export something, you're supposed
to send an email to an address at the commerce department,
saying the name of the program and the url where it can be
obtained and a few similar things.  In practice, email to
that address is ignored, they never check anything.  I heard
the address even stopped working for a while, though they
may have fixed it since then.  See
http://www.bxa.doc.gov/Encryption/ for info.  I've emailed
notices to the address a few times and never heard back
anything.  Anyway, I don't think this should count as
cryptography; it's simply using a cryptographic hash
function as an PRNG to avoid the correlations in other
PRNG's; scientific rationale for doing that is given in the
Numerical Recipes book mentioned above.

The code that I linked uses the standard API but I wonder if
the floating point output is optimally uniform, i.e. the
N/2**56 calculation may not be exactly the right thing for
an ieee float64.

Using the time of day is what the Random docs say to do by
default.  You're correct that any security application needs
to supply a higher entropy seed.

I would like it very much if the std lib included a module
that read some random bytes from the OS for OS's that
support it.  That means reading /dev/urandom on recent
Un*x-ish systems or Cygwin, or calling CryptGenRandom on
Windows.  Reading /dev/urandom is trivial, and there's a guy
on the pycrypt list who wrote a Windows function to call
CryptGenRandom and return the output through the Python API.
 I forwarded the function to Guido with the author's
permission but nothing seemed to happen with it.  However,
this gets away from this sharandom subclass.  I'd like to
make a few more improvements to the module but after that I
think it can be dropped into the lib.  Let me know what you
think.

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2004-03-16 06:52

Message:
Logged In: YES 
user_id=80475

One other thought:  if cryptographic strength is a goal, then 
seeding absolutely should require a long seed (key) as an 
input and the time should *never* be used.

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2004-03-16 06:49

Message:
Logged In: YES 
user_id=80475

It would have been ideal if the Random API had been 
designed with an integer generator at the core and floating 
point as a computed value, but that it the way it has been for 
a long time and efforts to switch it over would likely result in 
either incompatibility with existing subclasses or introducing 
new complexity (making it even harder to subclass).  I think 
the API should be left alone until Py3.0.

The attached module would make a good recipe on ASPN 
where improvements and critiques can be posted.  Do you 
have links to research showing that running SHA-1 in a 
cipher block feedback mode results in a cryptographically 
strong random number generator -- the result seems likely, 
but a research link would be great.  Is there a link to 
research showing the RNG properties of the resulting 
generator (period, equidistribution, passing tests for 
randomness, etc)?  Also, is there research showing the 
relative merits of this approach vs MD5, AES, or DES?

If something like this gets added to the library, I prefer it to 
be added to random.py using the existing API.  Adding yet 
another random module would likely do more harm than 
good.

One other question (I don't know the answer to this):  would 
including a cryptographically strong RNG trigger US export 
restrictions on the python distribution?

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=917055&group_id=5470