BUG?: sha returns same crc for diff. files!!

Thomas Weholt thomas at cintra.no
Fri Sep 15 03:28:19 EDT 2000


Hi,

I'm using SHA to generate crc codes in a large project, but I've allready
discovered equal crc codes for different files, even size is different. How
can this be? Someone told me that the chance of getting the same crc-code
for two different files are very slim. Why does this happen? Isn't sha the
module to use for this? Is md5 better? I need to be sure that the crc code
are uniq for that file or at least that it's a very big chance that the file
is uniq. Just testing it on some thousands files or so returned at least one
"collision".  It seem to only happen with jpg-images, or at least so far.
This is the code I used :

filename1 = 'K:\\KimIglinsky-0101-1.jpg'
filename2 = 'K:\\ShirleyMallman-1216-1.jpg'
size1 = os.stat(filename1)[6]
size2 = os.stat(filename2)[6]
crc1 = sha.sha(open(filename1).read()).hexdigest()
crc2 = sha.sha(open(filename2).read()).hexdigest()

print filename1, crc1, size1
print filename2, crc2, size2
print "CRC1 == CRC2 : ", crc1 == crc2, "Size1 == Size2:", size1 == size2

This is the output :

K:\KimIglinsky-0101-1.jpg 9486845232ae19c8fc1f9dc10d65ae2f4ac4d95e 158275
K:\ShirleyMallman-1216-1.jpg 9486845232ae19c8fc1f9dc10d65ae2f4ac4d95e 161972
CRC1 == CRC2 :  1 Size1 == Size2: 0

The output clearly says the size is different, but the crc the same.

Do I need to switch to a different module? Any comments?

Thomas





More information about the Python-list mailing list