Generate unique ID for URL

Tue Nov 13 20:39:04 EST 2012

In article <0692e6a2-343c-4eb0-be57-fe5c815efb99 at googlegroups.com>,
 Richard <richardbp at gmail.com> wrote:

> Hello,
> 
> I want to create a URL-safe unique ID for URL's.
> Currently I use:
> url_id = base64.urlsafe_b64encode(url)
> 
> >>> base64.urlsafe_b64encode('docs.python.org/library/uuid.html')
> 'ZG9jcy5weXRob24ub3JnL2xpYnJhcnkvdXVpZC5odG1s'
> 
> I would prefer more concise ID's. 
> What do you recommend? - Compression?

If you're generating random id strings, there's only two ways to make 
them shorter.  Either encode fewer bits of information, or encode them 
more compactly.

Let's start with the second one.  You're already using base64, so you're 
getting 6 bits per character.  You can do a little better than that, but 
not much.  The set of URL-safe characters is the 96-ish printable ascii 
set, minus a few pieces of punctuation.  Maybe you could get it up to 
6.3 or 6.4 bits per character, but that's about it.  For the complexity 
this would add it's probably not worth it.

The next step is to reduce the number of bits you are encoding.  You 
said in another post that "1 collision in 10 million hashes would be 
tolerable".  So you need:

>>> math.log(10*1000*1000, 2)
23.25349666421154

24 bits worth of key.  Base64 encoded, that's only 4 characters.  
Actually, I probably just proved that I don't really understand how 
probabilities work, so maybe what you really need is 32 or 48 or 64 
bits.  Certainly not the 264 bits you're encoding with your example 
above.

So, something like:

hash = md5.md5('docs.python.org/library/uuid.html').digest()
hash64 = base64.urlsafe_b64encode(hash)
id = hash64[:8]  # or 12, or whatever

But, I still don't really understand your use case.  You've already 
mentioned the following requirements:

"just be used internally for quick lookups, not exposed publicly"
"URL-safe"
"unique"
"1 collision in 10 million hashes would be tolerable"
"one way encoding would be fine"
"performed millions of times so ideally efficient"

but haven't really explained what it is that you're trying to do.

If they're not going to be exposed publicly, why do you care if they're 
URL-safe?

What's wrong with just using the URLs directly as dictionary keys and 
not worrying about it until you've got some hard data showing that this 
is not sufficient?