Howto find same files?

Mike Fletcher mfletch at tpresence.com
Sun Oct 29 00:51:46 EDT 2000


Here's a possibility:

	use os.path.walk to process each file in the tree of directories,
with a dictionary shared as the arguments (see os.path.walk docs)
		for each file
			create a 5 or 6 element tuple with the "identifying"
info, see the os, stat, and possibly sha/md5 modules.
			if the tuple is already a key in the dictionary:
				delete the current file, print a message to
the console telling you that the duplicate was found.  Print the name of the
current file and the value associated with the key in the dictionary (see
next step for what that will be).
			else:
				add a tuple:full_filename key to the
dictionary

Now, of course, with hundreds of thousands of files, that may take a while
to run, but if it turns out to work for you, then hey, you're done and can
go on to the next project.

Enjoy yourself,
Mike


-----Original Message-----
From: gregoire.favre at ima.unil.ch [mailto:gregoire.favre at ima.unil.ch]
Sent: Saturday, October 28, 2000 5:16 PM
To: python-list at python.org
Subject: Howto find same files?


Hello,

two friends tell me that I should go to python to solve my problem:
I have fetched some files (quite a lots) that I have put in /data (a
lots of patchxxx.{gz,bz2} of lots of things, lots of midi files grabbed
from newsgroups using newsfetch, some mp3,... too much files that I
have put in some dirs (just for having one idea, >find /data|wc -l gives
me 128291... that's too much for hand...).

What I want to do is to find the files that are the same, a good start
could be the files which have same name and same size, better would be
to find files that are same size (I have for examples a lot of 1.mid...)
and then do a kind of diff between then and if there are the same, rm
the copies).

I have read half of the python tutorials and I don't know how to
begin...

Would it be a good idea to create a files which contains the
path,filename,size,md5sum and then working on it?

Has someone another idea or as someone already programmed that?

Thanks you very much,

	Greg


Sent via Deja.com http://www.deja.com/
Before you buy.
-- 
http://www.python.org/mailman/listinfo/python-list




More information about the Python-list mailing list