unicode and hashlib

Scott David Daniels Scott.Daniels at Acm.Org
Sat Nov 29 12:27:40 EST 2008


Jeff H wrote:
> ...
> Actually, what I am surprised by, is the fact that hashlib cares at
> all about the encoding.  A md5 hash can be produced for an .iso file
> which means it can handle bytes, why does it care what it is being
> fed, as long as there are bytes.  I would have assumed that it would
> take whatever was feed to it and view it as a byte array and then hash
> it.  You can read a binary file and hash it
>   print md5.new(file('foo.iso').read()).hexdigest()
> What do I need to do to tell hashlib not to try and decode, just treat
> the data as binary?

If you do not care about portability or reproducability, you can just go
with the bytes you get to most easily.

To take your example:
     with open('foo.iso', 'r'):
         print hashlib.md5(src.read()).hexdigest()

will print different things on Linux and windows.

     with open('foo.iso', 'rb'):
         print hashlib.md5(src.read()).hexdigest()

should print the same thing on both; hashingdoes not magically allow
you to stop thinking.

If you now, and for all time, decide that the only source you will take 
is cp1252, perhaps you should decode to cp1252 before hashing.

Even if you have Unicode, you can have alternative Unicode expression
of the same "characters," so you may want to convert the Unicode to a
"Normalized Form" of Unicode before decoding to bytes.  The major
candidates for that are NFC, NFD, NFKC, and NFKD, see:
     http://unicode.org/reports/tr15/
Again, once have chosen your normalized form (or decided to skip the
normalization step), I'd suggest going to UTF-8 (which is pretty
unambiguous) and them hash the result.  The problem with another choice
is that UTF-16 comes in two flavors (UTF-16BE and UTF-16LE); UTF-32 also
has two flavors (UTF-32BE and UTF-32LE), and whatever your current
Python, you may well switch between UTF-16 and UTF-32 internally at some
point as you do regular upgrades (or BE vs. LE if you switch CPUs).

--Scott David Daniels
Scott.Daniels at Acm.Org

you'll have to decide
, but you could



More information about the Python-list mailing list