hash()

Tue Dec 6 10:44:07 EST 2005

Tim Peters wrote:
> [John Marshall]

> Second, what are your assumptions about (a) the universe of strings;
> and, (b) the hash function?

My assumptions are:
(a) valid and "reasonable" pathnames (e.g., 1024
     characters long)
(b) just the builtin hash().

The goal is to be able to keep some meta-data
for each file/directory of a directory hierarchy
in a separate directory: one meta-data file per
file/directory. I am interested in getting an
idea of how often a collision will occur and
require a second level of checking.

 From an article I read, the Reiser-fs keeps acl
information in a truly hidden directory and
uses the inode number of the corresponding
fs object to maintain correspondence between
the two. Of course, you get unique numbers
this way. But in my case, I couldn't necessarily
guarantee this because that would mean tracking
the file and inode number in a way that is closer
to the OS than I am.

> Assuming a finite universe of strings (also finite ;-)), and a hash
> function that returns each of its H possible results "at random"
> (meaning that there's no algorithmic way to predict any bit of the
> hash output short of running the hash function), then the probability
> that two distinct strings have the same hash is 1/H.  It doesn't
> matter to this outcome whether one input is the reversal of the other.
> 
> 
>>My goal is to uniquely identify multicharacter strings,
> 
> 
> Unclear what that means.  Obviously, if your string universe contains
> more than H strings, it's impossible for any hash function with H
> possible values to return a different hash value each input.

Right. I should've/could've looked into the hash
function used. I suppose my question was how
well does the hash() distribute to avoid collisions?
I.e., are some hash values more likely to come up
than others?

>>all of which begin with "/" and never end with "/".
>>Therefore, st != st[::-1].
> 
> 
> Is it simply
> because this condition eliminates palindromes from your input
> universe?

That is a benefit but it is more a matter of the
string input to hash() always being an absolute
pathname. Also, not knowing how the hash() works,
I was wondering if such a characteristic actually
improved the results at all.

> Anyway, to be concrete, using CPython's hash function on a 32-bit box,
> H = 2**32-1.  Call a string `s` bad iff:
> 
>     s[0] == "/" and s[-1] != "/" and hash(s) == hash("".join(reversed(s)))
> 
> Then there are no bad strings of length 1, 2, 3, or 4.  There are 4
> bad strings of length 5:
> 
> '/\xde&\xf6C'
> '/\xca\x0e\xfaC'
> '/\xc4\x06\xfcC'
> '/\xad\xd6\x01\xd6'
> 
> I didn't think about this -- I just wrote a little program to try all
> ~= 4 billion such strings.  So if your universe is the set of all
> 5-character 8-bit strings that start with '/' but don't end with '/',
> and you pick inputs uniformly at random from that universe, the chance
> of a hash match between a string and its reversal is
> 
>     4 / (256**3 * 255)
> 
> or a little less than 1 in a billion.  For a truly random hash
> function with a 32-bit output, the chance would be a little less than
> 1 in 4 billion.
> 
> It would be mildly surprising if those odds got worse as string length
> increased.  The md5 and sha hashes have much larger H, and were
> designed for (among other things) good collision resistance.

Your answer was what I was hoping to find out.

Thanks a lot to you and others.

John