unicode and hashlib

Scott David Daniels Scott.Daniels at Acm.Org
Mon Dec 1 19:35:44 EST 2008


Bryan Olson wrote:
> ... I think that's good behavior, except that the error message is likely
> to end beginners to look up the obscure buffer interface before they find 
> they just need mystring.decode('utf8') or bytes(mystring, 'utf8').
Oops, careful here (I made this mistake once in this thread as well). 
You _decode_ from unicode to bytes.  The code you quoted doesn't run.
This does:

 >>> a = 'Andr\xe9'
 >>> b = unicode(a, 'cp1252')
 >>> b.encode('utf-8')
'Andr\xc3\xa9'
 >>> b.decode('utf-8')

Traceback (most recent call last):
   File "<pyshell#19>", line 1, in <module>
     b.decode('utf-8')
   File "C:\Python26\lib\encodings\utf_8.py", line 16, in decode
     return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in 
position 4: ordinal not in range(128)

 >>> hashlib.md5(b.encode('utf-8')).hexdigest()
'45f1deffb45a5f6c2380a4cee9b3e452'

 >>> hashlib.md5(b.decode('utf-8')).hexdigest()

Traceback (most recent call last):
   File "<pyshell#21>", line 1, in <module>
     hashlib.md5(b.decode('utf-8')).hexdigest()
   File "C:\Python26\lib\encodings\utf_8.py", line 16, in decode
     return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in 
position 4: ordinal not in range(128)


> Incidentally, MD5 has fallen and SHA-1 is falling. Python's hashlib also 
> includes the stronger SHA-2 family.

Well, the choice of hash always depends on the app.


-Scott



More information about the Python-list mailing list