Python SHA-1 as a method for unique file identification ? [help!]

EP eric.pederson at gmail.com
Wed Jun 21 03:09:39 EDT 2006


This inquiry may either turn out to be about the suitability of the
SHA-1 (160 bit digest) for file identification, the sha function in
Python ... or about some error in my script.  Any insight appreciated
in advance.

I am trying to reduce duplicate files in storage at home - I have a
large number files (e.g. MP3s) which have been stored on disk multiple
times under different names or on different paths.  The using
applications will search down from the top path and find the files - so
I do not need to worry about keeping track of paths.

All seemed to be working until I examined my log files and found files
with the same SHA digest had different sizes according to
os.stat(fpath).st_size .  This is on Windows XP.

    -  Am I expecting too much of SHA-1?
    -  Is it that the os.stat data on Windows cannot be trusted?
    -  Or perhaps there is a silly error in my code I should have seen?

Thanks

- Eric

-  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -

Log file extract:

Dup:  no    Path:  F:\music\mp3s\01125.mp3    Hash:
00b3acb529aae11df186ced8424cb189f062fa48    Name:  01125.mp3    Size:
63006
Dup:  YES    Path:  F:\music\mp3s\0791.mp3    Hash:
00b3acb529aae11df186ced8424cb189f062fa48    Name:  0791.mp3    Size:
50068
Dup:  YES    Path:  F:\music\mp3s\12136.mp3    Hash:
00b3acb529aae11df186ced8424cb189f062fa48    Name:  12136.mp3    Size:
51827
Dup:  YES    Path:  F:\music\mp3s\11137.mp3    Hash:
00b3acb529aae11df186ced8424cb189f062fa48    Name:  11137.mp3    Size:
56417
Dup:  YES    Path:  F:\music\mp3s\0991.mp3    Hash:
00b3acb529aae11df186ced8424cb189f062fa48    Name:  0991.mp3    Size:
59043
Dup:  YES    Path:  F:\music\mp3s\0591.mp3    Hash:
00b3acb529aae11df186ced8424cb189f062fa48    Name:  0591.mp3    Size:
59162
Dup:  YES    Path:  F:\music\mp3s\10140.mp3    Hash:
00b3acb529aae11df186ced8424cb189f062fa48    Name:  10140.mp3    Size:
59545
Dup:  YES    Path:  F:\music\mp3s\0491.mp3    Hash:
00b3acb529aae11df186ced8424cb189f062fa48    Name:  0491.mp3    Size:
63101
Dup:  YES    Path:  F:\music\mp3s\0392.mp3    Hash:
00b3acb529aae11df186ced8424cb189f062fa48    Name:  0392.mp3    Size:
63252
Dup:  YES    Path:  F:\music\mp3s\0891.mp3    Hash:
00b3acb529aae11df186ced8424cb189f062fa48    Name:  0891.mp3    Size:
65808
Dup:  YES    Path:  F:\music\mp3s\0691.mp3    Hash:
00b3acb529aae11df186ced8424cb189f062fa48    Name:  0691.mp3    Size:
67050
Dup:  YES    Path:  F:\music\mp3s\0294.mp3    Hash:
00b3acb529aae11df186ced8424cb189f062fa48    Name:  0294.mp3    Size:
67710


Code:

# Dedup_inplace.py
# vers .02
# Python 2.4.1

# Create a dictionary consisting of hash:path
# Look for 2nd same hash and delete path

testpath=r"F:\music\mp3s"
logpath=r"C:\testlog6.txt"

import os, sha

def hashit(pth):
    """Takes a file path and returns a SHA hash of its string"""
    fs=open(pth,'r').read()
    sh=sha.new(fs).hexdigest()
    return sh

def logData(d={}, logfile="c://filename999.txt", separator="\n"):
    """Takes a dictionary of values and writes them to the provided
file path"""
    logstring=separator.join([str(key)+":  "+d[key] for key in
d.keys()])+"\n"
    f=open(logfile,'a')
    f.write(logstring)
    f.close()
    return

def walker(topPath):
    fDict={}
    logDict={}
    limit=1000
    freed_space=0
    for root, dirs, files in os.walk(topPath):
        for name in files:
            fpath=os.path.join(root,name)
            fsize=os.stat(fpath).st_size
            fkey=hashit(fpath)
            logDict["Name"]=name
            logDict["Path"]=fpath
            logDict["Hash"]=fkey
            logDict["Size"]=str(fsize)
            if fkey not in fDict.keys():
                fDict[fkey]=fpath
                logDict["Dup"]="no"
            else:
                #os.remove(fpath)      --uncomment only when script
proven
                logDict["Dup"]="YES"
                freed_space+=fsize
            logData(logDict, logpath, "\t")
            items=len(fDict.keys())
            print "Dict entry:  ",items,
            print "Cum freed space:  ",freed_space
            if items > limit:
                break
        if items > limit:
                break

def emptyNests(topPath):
    """Walks downward from the given path and deletes any empty
directories"""
    for root, dirs, files in os.walk(topPath):
        for d in dirs:
            dpath=os.path.join(root,d)
            if len(os.listdir(dpath))==0:
                print "deleting:  ", dpath
                os.rmdir(dpath)
               
walker(testpath)
emptyNests(testpath)




More information about the Python-list mailing list