sorting with expensive compares?

bonono at gmail.com bonono at gmail.com
Fri Dec 23 12:20:55 EST 2005


Dan Stromberg wrote:
> On Thu, 22 Dec 2005 22:06:42 +0000, Dan Stromberg wrote:
>
> >
> > Hi folks.
> >
> > Python appears to have a good sort method, but when sorting array elements
> > that are very large, and hence have very expensive compares, is there some
> > sort of already-available sort function that will merge like elements into
> > a chain, so that they won't have to be recompared as many times?
> >
> > Thanks!
>
> I probably should've been more specific.
>
> I'm wanting to sort a large number of files, like a bunch of output files
> from a large series of rsh or ssh outputs on a large series of distinct
> machines, a music collection in .ogg format (strictly redistributable and
> legally purchased music), a collection of .iso cdrom images (strictly
> redistributable and legally purchased software), and so forth.
>
> I'm not sorting the content of each file individually.
>
> I'm treating each file as a potentially very large string, and "sorting
> the strings".
>
> I've been using the following compare function, which in short checks, in
> order:
>
> 1) device number
> 2) inode number
> 3) file length
> 4) the beginning of the file
> 5) an md5 hash of the entire file
> 6) the entire file
>
> (If #1 and #2 are identical, then the file must be a hardlink to the other
> file.  Also, files that do not have the same length can never be
> identical.  And of course these items are generated on demand, making it
> frequently possible to avoid doing full-file-length compare on a lot of
> files)
>
> However, my program is still doing more #6 comparisons than seems
> strictly necessary when I could just toss all the filenames describing
> identical files into a list, and avoid re-comparing files with identical
> content over and over - I don't want to compare them to each other again
> and again), when there are a lot of identical files in the input list.
>
Why would #5 not enough as an indicator that the files are indentical ?




More information about the Python-list mailing list