unicode and hashlib

Mon Dec 1 08:53:51 EST 2008

Jeff H wrote:
> [...] So once I have character strings transformed
> internally to unicode objects, I should encode them in 'utf-8' before
> attempting to do things that guess at the proper way to encode them
> for further processing.(i.e. hashlib)

It looks like hashlib in Python 3 will not even attempt to digest a 
unicode object. Trying to hash 'abcdefg' in in Python 3.0rc3 I get:

   TypeError: object supporting the buffer API required

I think that's good behavior, except that the error message is likely to 
send beginners to look up the obscure buffer interface before they find 
they just need mystring.decode('utf8') or bytes(mystring, 'utf8').

>>>> a='André'
>>>> b=unicode(a,'cp1252')
>>>> b
> u'Andr\xc3\xa9'
>>>> hashlib.md5(b.encode('utf-8')).hexdigest()
> 'b4e5418a36bc4badfc47deb657a2b50c'

Incidentally, MD5 has fallen and SHA-1 is falling. Python's hashlib also 
includes the stronger SHA-2 family.

-- 
--Bryan