sorting with expensive compares?

Stuart D. Gathman stuart at bmsi.com
Wed Dec 28 03:07:27 EST 2005


On Sat, 24 Dec 2005 15:47:17 +1100, Steven D'Aprano wrote:

> On Fri, 23 Dec 2005 17:10:22 +0000, Dan Stromberg wrote:
>
>> I'm treating each file as a potentially very large string, and "sorting
>> the strings".
> 
> Which is a very strange thing to do, but I'll assume you have a good
> reason for doing so.

I believe what the original poster wants to do is eliminate duplicate
content from a collection of ogg/whatever files with different names. 
E.g., he has a python script that goes out and collects all the free music
it can find on the web.  The same song may appear on many sites under
different names, and he wants only one copy of a given song.

In any case, as others have pointed out, sorting by MD5 is sufficient
except in cases far less probable than hardware failure - and deliberate
collisions.  E.g., the RIAA creates collision pairs of MP3 files where one
member carries a freely redistributable license, and the other a "copy
this and we'll sue your ass off" license in an effort to trap the unwary.

-- 
	      Stuart D. Gathman <stuart at bmsi.com>
Business Management Systems Inc.  Phone: 703 591-0911 Fax: 703 591-6154
"Confutatis maledictis, flamis acribus addictis" - background song for
a Microsoft sponsored "Where do you want to go from here?" commercial.




More information about the Python-list mailing list