[Tutor] First program after PyCamp

Mon Jun 10 22:03:59 CEST 2013

Hello I just took a 3 day PyCamp and am working on my first program from
start to finish at the moment and running into a problem.

Not sure how it's supposed to be on this list so I'm going to first
describe what my program is supposed to do and where I'm having the
problems, then post the actual code.  Please don't simply come back with a
working version of the code since I want to know what each step does, but
also explain where I went wrong and why the new version is better, or even
better suggest where I should look to fix it without actually fixing it
for me.

The program is supposed to be given a directory and then look through that
directory and all sub-directories and find duplicate files and output the
list of duplicates to a text file.  This is accomplished by generating a
MD5 hash for the files then comparing that hash to a list of previous
hashes that have been generated.

My problem is the text file that is output seems to contain EVERY file
that the program went through rather than just the duplicates.  I've tried
to mentally step through and figure out where/why it's listing all of them
rather than just duplicates and I seem to be failing to spot it.

here's the code:
---begin code paste---
import os, hashlib
#rootdir = 'c:\Python Test'
hashlist = {}  # content signature -> list of filenames
dups = []

def get_hash(rootdir):
#   """goes through directory tree, compares md5 hash of all files,
#   combines files with same hash value into list in hashmap directory"""
    for path, dirs, files in os.walk(rootdir):
        #this section goes through the given directory, and all
subdirectories/files below
        #as part of a loop reading them in
        for filename in files:
            #steps through each file and starts the process of getting the
MD5 hashes for the file
            #comparing that hash to known hashes that have already been
calculated and either merges it
            #with the known hash (which indicates duplicates) or adds it
so that it can be compared to future
            #files
            fullname = os.path.join(path, filename)
            with open(fullname) as f:
                #does the actual hashing
                md5 = hashlib.md5()
                while True:
                    d = f.read(4096)
                    if not d:
                        break
                    md5.update(d)
                h = md5.hexdigest()
                filelist = hashlist.setdefault(h, [])
                filelist.append(fullname)

    for x in hashlist:
        currenthash = hashlist[x]
        #goes through and if has has more than one file listed with it
        #considers it a duplicate and adds it to the output list
        if len(currenthash) > 1:
            dups.append(currenthash)
    output = open('duplicates.txt','w')
    output.write(str(dups))
    output.close()

---end code paste---

Thank you for taking the time to help me

Bryan