Generating valid identifiers

Ian Kelly ian.g.kelly at gmail.com
Thu Jul 26 16:00:58 EDT 2012


On Thu, Jul 26, 2012 at 1:28 PM, Ian Kelly <ian.g.kelly at gmail.com> wrote:
> The odds of a given pair of identifiers having the same digest to 10
> hex digits are 1 in 16^10, or approximately 1 in a trillion.  If you
> bought one lottery ticket a day at those odds, you would win
> approximately once every 3 billion years.  But it's not enough just to
> have a hash collision, they also have to match exactly in the first 21
> (or 30, or whatever) characters of their actual names, and they have
> to both be long enough to invoke the truncating scheme in the first
> place.
>
> The Oracle backend for Django uses this same approach with an MD5 sum
> to ensure that identifiers will be no more than 30 characters long (a
> hard limit imposed by Oracle).  It actually truncates the hash to 4
> digits, though, not 10.  This hasn't caused any problems that I'm
> aware of.

As a side note, the odds of having at least one hash collision among
multiple tables are known as the birthday problem.  At 4 hex digits
there are 65536 possible digests, and it turns out that at 302 tables
there is a >50% chance that at least one pair of those names have the
same 4-digit digest.  That doesn't mean you should be concerned if you
have 302 tables in your Django Oracle database, though, because those
colliding tables also have to match completely in the first 26
characters of their generated names, which is not that common.  If a
collision ever did occur, the resolution would be simple: manually set
the name of one of the offending tables in the model definition.

With 16 ** 10 possible digests, the probability of collision hits 50%
at 1234605 tables.



More information about the Python-list mailing list