unicode and hashlib

Sat Nov 29 21:54:10 EST 2008

On Nov 29, 12:23 pm, Scott David Daniels <Scott.Dani... at Acm.Org>
wrote:
> Scott David Daniels wrote:
>
> ...
>
> > If you now, and for all time, decide that the only source you will take
> > is cp1252, perhaps you should decode to cp1252 before hashing.
>
> Of course my dyslexia sticks out here as I get encode and decode exactly
> backwards -- Marc 'BlackJack' Rintsch has it right.
>
> Characters (a concept) are "encoded" to a byte format (representation).
> Bytes (a precise representation) are "decoded" to characters (a format
> with semantics).
>
> --Scott David Daniels
> Scott.Dani... at Acm.Org

Ok, so the fog lifts, thanks to Scott and Marc, and I begin to realize
that the hashlib was trying to encode (not decode) my unicode object
as 'ascii' (my default encoding) and since that resulted in characters
>128 - shhh'boom.  So once I have character strings transformed
internally to unicode objects, I should encode them in 'utf-8' before
attempting to do things that guess at the proper way to encode them
for further processing.(i.e. hashlib)

>>> a='André'
>>> b=unicode(a,'cp1252')
>>> b
u'Andr\xc3\xa9'
>>> hashlib.md5(b.encode('utf-8')).hexdigest()
'b4e5418a36bc4badfc47deb657a2b50c'

Scott then points out that utf-8 is probably superior (for use within
the code I control) to utf-16 and utf-32 which both have 2 variants
and sometimes which one used is based on installed software and/or
processors. utf-8 unlike -16/-32 stays reliable and reproducible
irrespective of software or hardware.

decode vs encode
You decode from on character set to a unicode object
You encode from a unicode object to a specifed character set

Please correct me if you see something wrong and thank you for your
advice and direction.

u'unicordial-ly yours. ;)'
Jeff