[perl-python] a program to delete duplicate files
Terry Hancock
hancock at anansispaceworks.com
Wed Mar 9 17:13:20 EST 2005
On Wednesday 09 March 2005 06:56 am, Xah Lee wrote:
> here's a large exercise that uses what we built before.
>
> suppose you have tens of thousands of files in various directories.
> Some of these files are identical, but you don't know which ones are
> identical with which. Write a program that prints out which file are
> redundant copies.
For anyone interested in responding to the above, a starting
place might be this maintenance script I wrote for my own use. I don't
think it exactly matches the spec, but it addresses the problem. I wrote
this to clean up a large tree of image files once. The exact behavior
described requires the '--exec="ls %s"' option as mentioned in the help.
#!/usr/bin/env python
# (C) 2003 Anansi Spaceworks
#---------------------------------------------------------------------------
# find_duplicates
"""
Utility to find duplicate files in a directory tree by
comparing their checksums.
"""
#---------------------------------------------------------------------------
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
#---------------------------------------------------------------------------
import os, sys, md5, getopt
def file_walker(tbl, srcpath, files):
"""
Visit a path and collect data (including checksum) for files in it.
"""
for file in files:
filepath = os.path.join(srcpath, file)
if os.path.isfile(filepath):
chksum = md5.new(open(os.path.join(srcpath, file)).read()).digest()
if not tbl.has_key(chksum): tbl[chksum]=[]
tbl[chksum].append(filepath)
def find_duplicates(treeroot, tbl=None):
"""
Find duplicate files in directory.
"""
dup = {}
if tbl is None: tbl = {}
os.path.walk(treeroot, file_walker, tbl)
for k,v in tbl.items():
if len(v) > 1:
dup[k] = v
return dup
usage = """
USAGE: find_duplicates <options> [<path ...]
Find duplicate files (by matching md5 checksums) in a
collection of paths (defaults to the current directory).
Note that the order of the paths searched will be retained
in the resulting duplicate file lists. This can be used
with --exec and --index to automate handling.
Options:
-h, -H, --help
Print this help.
-q, --quiet
Don't print normal report.
-x, --exec=<command string>
Python-formatted command string to act on the indexed
duplicate in each duplicate group found. E.g. try
--exec="ls %s"
-n, --index=<index into duplicates>
Which in a series of duplicates to use. Begins with '1'.
Default is '1' (i.e. the first file listed).
Example:
You've copied many files from path ./A into path ./B. You want
to delete all the ones you've processed already, but not
delete anything else:
% find_duplicates -q --exec="rm %s" --index=1 ./A ./B
"""
def main():
action = None
quiet = 0
index = 1
dup = {}
opts, args = getopt.getopt(sys.argv[1:], 'qhHn:x:',
['quiet', 'help', 'exec=', 'index='])
for opt, val in opts:
if opt in ('-h', '-H', '--help'):
print usage
sys.exit()
elif opt in ('-x', '--exec'):
action = str(val)
elif opt in ('-n', '--index'):
index = int(val)
elif opt in ('-q', '--quiet'):
quiet = 1
if len(args)==0:
dup = find_duplicates('.')
else:
tbl = {}
for arg in args:
dup = find_duplicates(arg, tbl=tbl)
for k, v in dup.items():
if not quiet:
print "Duplicates:"
for f in v: print "\t%s" % f
if action:
os.system(action % v[index-1])
if __name__=='__main__':
main()
--
--
Terry Hancock ( hancock at anansispaceworks.com )
Anansi Spaceworks http://www.anansispaceworks.com
More information about the Python-list
mailing list