unicode and hashlib

Sat Nov 29 11:29:01 EST 2008

On Sat, 29 Nov 2008 06:51:33 -0800, Jeff H wrote:

> Actually, what I am surprised by, is the fact that hashlib cares at all
> about the encoding.  A md5 hash can be produced for an .iso file which
> means it can handle bytes, why does it care what it is being fed, as
> long as there are bytes.

But you don't have bytes, you have a `unicode` object.  The internal byte 
representation is implementation specific and not your business.

>  I would have assumed that it would take
> whatever was feed to it and view it as a byte array and then hash it.

How?  There is no (sane) way to get at the internal byte representation.  
And that byte representation might contain things like pointers to memory 
locations that are different for two `unicode` objects which compare 
equal, so you would get different hash values for objects that otherwise 
look the same from the Python level.  Not very useful.

> You can read a binary file and hash it
>   print md5.new(file('foo.iso').read()).hexdigest()
> What do I need to do to tell hashlib not to try and decode, just treat
> the data as binary?

It's not about *de*coding, it is about *en*coding your `unicode` object 
so you get bytes to feed to the MD5 algorithm.

Ciao,
	Marc 'BlackJack' Rintsch