binary file comparison with the md5 module

Fredrik Lundh fredrik at pythonware.com
Wed Jun 13 15:02:04 EDT 2001


Christian Reyes wrote

> I'm trying to write a script that takes two binary files and returns whether
> or not their data is completely matching.
>
> One of my peers suggested that an efficient way to do this would be to run
> the md5 algorithm on each file and then compare the resultant output.

if you're comparing two binary files, that's not very efficient -- you
really don't have to read the *entire* file to figure out if there's any
differences...

a better solution is to start by comparing the sizes (if they're different,
the files cannot possible have the same content), and then read same-
sized chunks from both files.  as soon as two chunks differ, the files are
different.

the filecmp module implements this scheme:

    import filecmp
    if filecmp.cmp(file1, file2, shallow=0):
        print "same contents"

(the shallow=0 flag makes sure that filecmp.cmp checks the contents
even if the size and modification time attributes happens to match)

> I tried opening the file with the built-in python open command, and then
> reading the contents of the file into a buffer.  But I think my problem is
> that when I read the binary file into a buffer, the contents get tweaked
> somehow.
>
> >>> x = open('d:\\binary.wav')

if you double-check the docs (look for "open" under builtin functions
in the library reference), you'll notice that Python opens files in text
mode by default.

to open a binary file, add "rb" as the second argument to open:

> >>> x = open('d:/binary.wav', 'rb')

hope this helps!

</F>





More information about the Python-list mailing list