Calculate sha1 hash of a binary file

John Krukoff jkrukoff at ltgc.com
Wed Aug 6 15:56:09 EDT 2008


On Wed, 2008-08-06 at 12:31 -0700, LaundroMat wrote:
> Hi -
> 
> I'm trying to calculate unique hash values for binary files,
> independent of their location and filename, and I was wondering
> whether I'm going in the right direction.
> 
> Basically, the hash values are calculated thusly:
> 
> f = open('binaryfile.bin')
> import hashlib
> h = hashlib.sha1()
> h.update(f.read())
> hash = h.hexdigest()
> f.close()
> 
> A quick try-out shows that effectively, after renaming a file, its
> hash remains the same as it was before.
> 
> I have my doubts however as to the usefulness of this. As f.read()
> does not seem to read until the end of the file (for a 3.3MB file only
> a string of 639 bytes is being returned, perhaps a 00-byte counts as
> EOF?), is there a high danger for collusion?
> 
> Are there better ways of calculating hash values of binary files?
> 
> Thanks in advance,
> 
> Mathieu
> --
> http://mail.python.org/mailman/listinfo/python-list

Looks like you're doing the right thing from here. file.read( ) with no
size parameter will always return the whole file (for completeness, I'll
mention that the documentation warns this is not the case if the file is
in non-blocking mode, which you're not doing).

Python never treats null bytes as special in strings, so no, you're not
getting an early EOF due to that. 

I wouldn't worry about your hashing code, that looks fine, if I were you
I'd try and figure out what's going wrong with your file handles. I
would suspect that in where ever you saw your short read, you were
likely not opening the file in the correct mode or did not rewind the
file ( with file.seek( 0 ) ) after having previously read data from it.

You'll be fine if you use the code above as is, there's no problems I
can see with it.
-- 
John Krukoff <jkrukoff at ltgc.com>
Land Title Guarantee Company




More information about the Python-list mailing list