question: why isn't a byte of a hash more uniform? how could I improve my code to cure that?

Fri Aug 7 15:15:56 EDT 2009

László Sándor wrote:
> Thank you, Tim. My comments are below.
> 
> On 2009-08-07 13:19:47 -0400, Tim Chase <python.list at tim.thechases.com> 
> said:
> 
>>> After I have written a short Python script that hashes my textfile 
>>> line by
>>> line and collects the numbers next to the original, I checked what I 
>>> got.
>>> Instead of getting around 25% in each treatment, the range is 
>>> 17.8%-31.3%.
>>
>>
>> That sounds suspiciously like 25% with a +/- 7% fluctuation one might 
>> expect to see from non-random source data.
> 
> 
> Could you help me where this range comes from? (If all my lines were 
> "42", I wouldn't hit this range. So it cannot be a rule. Right?)
> 
>>
>> Remember that your outputs are driven purely by your inputs in a 
>> deterministic fashion -- if your inputs are purely random, then your 
>> outputs should more closely match your expected bin'ing.  If your 
>> inputs aren't random, you get a taste of your own medicine ("my file 
>> has just the number 42 on every line...why isn't my output random?").  
>> And randomness-of-hash-output is a red herring since hashing is *not* 
>> random.
> 
> 
> Thanks, I tried to be correct with "pseudo-random". I understand that 
> everything is dependent on input. I want it to be the case. However, I 
> hoped that good hashes produce random-looking output from input with 
> very little variation. It would be strange if I could not get more than 
> 18% of lines with a remainder of 3 (after division by 4), whatever hash 
> I try just because these are names of people.
> 
>>
>> Your input is also finite -- an aspect which leaves you a far cry from 
>> the full hash-space.  If an md5 has 32 bytes (256 bits) of data, your 
>> input would have to cover 2**256 possible inputs to see the full 
>> profile of your outputs.  That's a lot of input :)
>>
>> -tkc
> 
> 
> OK, I understand. Could anyone suggest a better way to do this, then?
> 
> (Recap: random-looking, close-to uniform assignment of one number out of 
> four possibilities to strings.)
> 
> Thanks, everyone.
> 
> Laszlo
> 

Well,  a very simplistic method is to use the length of the input string 
modulus four.  If the names have decently "random" lengths, that may work.

~Ethan~