sorting with expensive compares?

Fri Dec 23 13:04:36 EST 2005

Dan Stromberg <strombrg at dcs.nac.uci.edu> wrote:
   ...
> I'm wanting to sort a large number of files, like a bunch of output files
> from a large series of rsh or ssh outputs on a large series of distinct
> machines, a music collection in .ogg format (strictly redistributable and
> legally purchased music), a collection of .iso cdrom images (strictly
> redistributable and legally purchased software), and so forth.
> 
> I'm not sorting the content of each file individually.
> 
> I'm treating each file as a potentially very large string, and "sorting
> the strings".

OK, very clear.

> I've been using the following compare function, which in short checks, in
> order:
> 
> 1) device number
> 2) inode number
> 3) file length
> 4) the beginning of the file
> 5) an md5 hash of the entire file
> 6) the entire file

Makes sense, including the possibility of computing some items on
demand.  However, using a key-callable rather than a cmp-callable may
still make sense -- you just need a callable that extracts the
attributes on demand (and caches them thereafter... assuming you have
enough memory to keep all the caches, I guess [6] might be a problem).
I do see that key-callables won't necessarily work easily here, but
since the key-callable's __cmp__ will in turn be called, that might
still work better... but, on reflection, it's probably just a minor
gain.

> However, my program is still doing more #6 comparisons than seems
> strictly necessary when I could just toss all the filenames describing
> identical files into a list, and avoid re-comparing files with identical
> content over and over - I don't want to compare them to each other again
> and again), when there are a lot of identical files in the input list.

The comparison should never get called twice on the same two objects,
anyway.  So, I'm not sure what you think you could gain by optimizing
future comparisons of two objects once you've ascertained they are in
fact equal.  Still, assuming for example that self.name is a unique
identifier (i.e. the so-called name is in fact a complete path), the
optimization (memoization) is quite easy to perform.  Rename what you
currently have as __cmp__ by another name, say _real_cmp, add a
_compared dict to the class, and code the following __cmp__ ...:

_compared = {}
def __cmp__(self, other):
    try: return -self._compared[other.name, self.name]
    except KeyError: pass
    key = self.name, other.name
    if key in self._compared: return self._compared[key]
    result = self_compared[key] = self._real_cmp(other)
    return result

I would not expect this optimization to matter at all, because no key
should ever be already present in the self._compared dictionary (the
same two files should never, ever get compared twice anyway).

However, it might be possible to extend this idea by using the
properties you know an ordering should have -- if A and B have never
been compared, but both have been compared with C, in some cases you
don't need to compare A and B but may exploit already-known results.
For example, if A==C and B==C, you already know that A==B without any
actual comparison; if A<C and B==C, you already know that A<B; and so
on.  Of course in some cases you still must work, e.g. if you know A<C
and B<C that doesn't help.  I'm not sure if this would help at all,
either: it's quite possible that the sort algorithm already exploits
these possibilities internally.  But, perhaps, it may be worth a try.

Alex