[perl-python] a program to delete duplicate files

Claudio Grondi claudio.grondi at freenet.de
Sun Mar 20 12:10:38 EST 2005


>> I'll post my version in a few days.
Have I missed something?
Where can I see your version?

Claudio


"Xah Lee" <xah at xahlee.org> schrieb im Newsbeitrag
news:1110372973.657649.212920 at l41g2000cwc.googlegroups.com...
> here's a large exercise that uses what we built before.
>
> suppose you have tens of thousands of files in various directories.
> Some of these files are identical, but you don't know which ones are
> identical with which. Write a program that prints out which file are
> redundant copies.
>
> Here's the spec.
> --------------------------
> The program is to be used on the command line. Its arguments are one or
> more full paths of directories.
>
> perl del_dup.pl dir1
>
> prints the full paths of all files in dir1 that are duplicate.
> (including files in sub-directories) More specifically, if file A has
> duplicates, A's full path will be printed on a line, immediately
> followed the full paths of all other files that is a copy of A. These
> duplicates's full paths will be prefixed with "rm " string. A empty
> line follows a group of duplicates.
>
> Here's a sample output.
>
> inPath/a.jpg
> rm inPath/b.jpg
> rm inPath/3/a.jpg
> rm inPath/hh/eu.jpg
>
> inPath/ou.jpg
> rm inPath/23/a.jpg
> rm inPath/hh33/eu.jpg
>
> order does not matter. (i.e. which file will not be "rm " does not
> matter.)
>
> ------------------------
>
> perl del_dup.pl dir1 dir2
>
> will do the same as above, except that duplicates within dir1 or dir2
> themselves not considered. That is, all files in dir1 are compared to
> all files in dir2. (including subdirectories) And, only files in dir2
> will have the "rm " prefix.
>
> One way to understand this is to imagine lots of image files in both
> dir. One is certain that there are no duplicates within each dir
> themselves. (imagine that del_dup.pl has run on each already) Files in
> dir1 has already been categorized into sub directories by human. So
> that when there are duplicates among dir1 and dir2, one wants the
> version in dir2 to be deleted, leaving the organization in dir1 intact.
>
> perl del_dup.pl dir1 dir2 dir3 ...
>
> does the same as above, except files in later dir will have "rm "
> first. So, if there are these identical files:
>
> dir2/a
> dir2/b
> dir4/c
> dir4/d
>
> the c and d will both have "rm " prefix for sure. (which one has "rm "
> in dir2 does not matter) Note, although dir2 doesn't compare files
> inside itself, but duplicates still may be implicitly found by indirect
> comparison. i.e. a==c, b==c, therefore a==b, even though a and b are
> never compared.
>
>
> --------------------------
>
> Write a Perl or Python version of the program.
>
> a absolute requirement in this problem is to minimize the number of
> comparison made between files. This is a part of the spec.
>
> feel free to write it however you want. I'll post my version in a few
> days.
>
> http://www.xahlee.org/perl-python/python.html
>
>  Xah
>  xah at xahlee.org
>  http://xahlee.org/PageTwo_dir/more.html
>





More information about the Python-list mailing list