Python SHA-1 as a method for unique file identification ? [help!]

Wed Jun 21 04:34:45 EDT 2006

[EP <eric.pederson at gmail.com>]
> This inquiry may either turn out to be about the suitability of the
> SHA-1 (160 bit digest) for file identification, the sha function in
> Python ... or about some error in my script

It's your script.  Always open binary files in binary mode.  It's a
disaster on Windows if you don't (if you open a file in text mode on
Windows, the OS pretends that EOF occurs at the first instance of byte
chr(26) -- this is an ancient Windows behavior that made an odd kind
of sense in the mists of history, and has persisted in worship of
Backward Compatibility despite that the original reason for it went
away _long_ ago).

...

> I am trying to reduce duplicate files in storage at home - I have a
> large number files (e.g. MP3s) which have been stored on disk multiple
> times under different names or on different paths.

...

> All seemed to be working until I examined my log files and found files
> with the same SHA digest had different sizes according to
> os.stat(fpath).st_size .  This is on Windows XP.
>
>     -  Am I expecting too much of SHA-1?

No.  SHA-1 should work well for this.

>     -  Is it that the os.stat data on Windows cannot be trusted?

It can be trusted to the extent that anything on Windows can be trusted ;-)

...
> def hashit(pth):
>     """Takes a file path and returns a SHA hash of its string"""
>     fs=open(pth,'r').read()

Make that 'rb' instead of 'r', and your problem will go away.

Do note that there are faster ways to go about this.  For example, if
a particular file has a unique size among the files you're looking at,
there's no reason to even open that file (let alone compute its hash).
 Group the files by size, and within a size group you can find
duplicates by, e.g., first comparing their initial 16 bytes, then the
next 32, then the next 64 ... etc.  If duplicates are uncommon, this
can be a huge savings.  For example, here's output from an old
dup-finding Python program of mine run over a directory tree which
happens to contain no duplicates:

Files
-----
Total               10,718
Unique              10,718
Duplicate                0
# w/ unique size    10,053
# w/ unique prefix     665

Bytes
-----
Total      1,401,668,015
Unique     1,401,668,015
Duplicate              0
Read              76,688
Excess            76,688

That last two lines mean it actually read a total of only about 77,000
file bytes on its way to proving there were no duplicates among about
11,000 files spanning about 1.4 gigabytes.

No hashes are computed by that program.  I'd use a hash if instead I
were going to save a persistent (across runs) set of known hashes, so
that answering "here's a new file -- same as one I already have?"
could be done by computing its hash.  While the program above
generally reads "amazingly" few file bytes, it hammers the OS file
directory services, and it would go faster to compute a hash of a new
file than to run the whole analysis again.