Generating a unique identifier

Fri Sep 7 22:17:44 EDT 2007

Steven D'Aprano <steve at REMOVE-THIS-cybersource.com.au> writes:
> > unique_id = itertools.count(1234567890)
> 
> Sweet!
> 
> I really must make itertools second-nature. I always forget it.

Beware that count() output (like enumerate) is always <= sys.maxint.
It raises an exception if it overflows.  IMO this is a bug.

> > def unique_id():
> >    return os.urandom(10).encode('hex')
> 
> Any time I see something using a random number to generate IDs, I worry 
> about collisions. Am I being paranoid? (But even paranoids write code 
> with bugs...)

Well, the idea is to make the string long enough (I shouldn't have
picked 10 bytes) to make the probability of collisions acceptably low.
The probability is about exp(-k**2/(2*N)) where k is the number of
id's you use and N is the number of possible ID's.  So with
os.urandom(20), if you use 1 billion (= approx 2**30) id's, 
the probability is about exp(-(2**60 / 2*2**160)) = 1/exp(2**101)
which is extremely small.  Using random strings is probably safer
from collisions than maintain keep distinct initial counters
across multiple runs of a program, on multiple computers, etc.

The classic example is ethernet cards.  Each one is supposed to have a
unique hardware 48-bit MAC address, with routers etc. relying on the
uniqueness.  Some organization is supposed to assign a unique code to
each manufacturer, and then the manufacturer uses that code as a
prefix and assigns addresses in sequence within that space.  That
works fine until sales go up, manufacturers start opening multiple
factories operating out of the same space, low-rent manufacturers
start making up their own codes, etc.  So companies that buy large
numbers of cards often find multiple cards with the same address,
causing various hassles.  If they just assigned 128-bit MAC addresses
at random it's very unlikely that there would ever be a collision.

> Here's something which is a little less predictable than a straight 
> counter:

It's still very easy to generate valid id's from that, or guess the
next one with non-negligible probability.  IMO it's almost never worth
messing with a "little less predictable".  If you're trying to prevent
an actual opponent from guessing something, use full-strength
cryptography.

You could try something like this:

    import sha, os, itertools

    radix = 2**32   # keep this == 2**(some even number <= 80)
    secret_key = os.urandom(20)

    def prp(key, n):
       # pseudo-random permutation (8-round Feistel network)
       # transform n to a unique number < radix**2
       assert 0 <= n < radix**2

       def F(i,x):
         return int(sha.new('%s,%x,%x'%(key,i,x)).hexdigest(), 16) % radix

       a,b = divmod(n, radix)
       for i in xrange(8):
         a ^= F(i,b)
         a,b = b,a   
       return radix*a + b

    unique_id = (prp(secret_key, i) for i in itertools.count())

It should be pretty secure and (unless I made an error) is a 
permutation from [0:radix**2] to [0:radix**2], similar to DES.
It's invertible if you know the secret key (I'll leave that as an
exercise).  8 rounds is probably overkill for radix 2**32.