BUG? sha-moduel returns same crc for different files

Kees de Laat kdelaat at compuserve.com
Fri Sep 15 14:48:16 EDT 2000


Thomas Weholt <thomas at cintra.no> schreef in berichtnieuws
8psovc$e9o$1 at oslo-nntp.eunet.no...
> Hi,
>
> I'm using SHA to generate crc codes in a large project, but I've allready
> discovered equal crc codes for different files, even size is different.
How
> can this be? Someone told me that the chance of getting the same crc-code
> for two different files are very slim. Why does this happen? Isn't sha the
> module to use for this? Is md5 better? I need to be sure that the crc code
> are uniq for that file or at least that it's a very big chance that the
file
> is uniq. Just testing it on some thousands files or so returned at least
one
> "collision".  It seem to only happen with jpg-images, or at least so far.
> This is the code I used :
>
> filename1 = 'K:\\KimIglinsky-0101-1.jpg'
> filename2 = 'K:\\ShirleyMallman-1216-1.jpg'
> size1 = os.stat(filename1)[6]
> size2 = os.stat(filename2)[6]
> crc1 = sha.sha(open(filename1).read()).hexdigest()
> crc2 = sha.sha(open(filename2).read()).hexdigest()
>
> print filename1, crc1, size1
> print filename2, crc2, size2
> print "CRC1 == CRC2 : ", crc1 == crc2, "Size1 == Size2:", size1 == size2
>
> This is the output :
>
> K:\KimIglinsky-0101-1.jpg 9486845232ae19c8fc1f9dc10d65ae2f4ac4d95e 158275
> K:\ShirleyMallman-1216-1.jpg 9486845232ae19c8fc1f9dc10d65ae2f4ac4d95e
161972
> CRC1 == CRC2 :  1 Size1 == Size2: 0
>
> The output clearly says the size is different, but the crc the same.
>
> Do I need to switch to a different module? Any comments?
>
> NB! If this message has been posted twice I apologize. The first posting
> didn't seem to pop up in the group at all.
>
> Thomas
>
>

Perhaps the files should be opened in binary mode, i.e. try:

  crc1 = sha.sha(open(filename1, 'rb').read()).hexdigest()
  crc2 = sha.sha(open(filename2, 'rb').read()).hexdigest()

Kees





More information about the Python-list mailing list