question: why isn't a byte of a hash more uniform? how could I improve my code to cure that?

Fri Aug 7 13:19:47 EDT 2009

> After I have written a short Python script that hashes my textfile line by
> line and collects the numbers next to the original, I checked what I got.
> Instead of getting around 25% in each treatment, the range is 17.8%-31.3%.

That sounds suspiciously like 25% with a +/- 7% fluctuation one 
might expect to see from non-random source data.

Remember that your outputs are driven purely by your inputs in a 
deterministic fashion -- if your inputs are purely random, then 
your outputs should more closely match your expected bin'ing.  If 
your inputs aren't random, you get a taste of your own medicine 
("my file has just the number 42 on every line...why isn't my 
output random?").  And randomness-of-hash-output is a red herring 
since hashing is *not* random.

Your input is also finite -- an aspect which leaves you a far cry 
from the full hash-space.  If an md5 has 32 bytes (256 bits) of 
data, your input would have to cover 2**256 possible inputs to 
see the full profile of your outputs.  That's a lot of input :)

-tkc