sorting with expensive compares?

Fri Dec 23 23:47:17 EST 2005

On Fri, 23 Dec 2005 17:10:22 +0000, Dan Stromberg wrote:

> I probably should've been more specific.
> 
> I'm wanting to sort a large number of files, like a bunch of output files
> from a large series of rsh or ssh outputs on a large series of distinct
> machines, a music collection in .ogg format (strictly redistributable and
> legally purchased music), a collection of .iso cdrom images (strictly
> redistributable and legally purchased software), and so forth.
> 
> I'm not sorting the content of each file individually.
> 
> I'm treating each file as a potentially very large string, and "sorting
> the strings".

Which is a very strange thing to do, but I'll assume you have a good
reason for doing so.

> I've been using the following compare function, 

My first suggestion is that you invest a bit of time in building a
small-scale set of data that you can test your sorting on, if you haven't
already done so. On multi-megabyte files, taking the MD5 hash or comparing
the content byte by byte is quite slow. So I would spend a few minutes
populating a directory with a dozen or so fake iso and ogg files, only a
few bytes in length each, and run my tests on them.

> which in short checks, in order:
> 
> 1) device number
> 2) inode number
> 3) file length
> 4) the beginning of the file
> 5) an md5 hash of the entire file
> 6) the entire file
> 
> (If #1 and #2 are identical, then the file must be a hardlink to the
> other file.  Also, files that do not have the same length can never be
> identical.  And of course these items are generated on demand, making it
> frequently possible to avoid doing full-file-length compare on a lot of
> files)

My second suggestion is to encapsulate vigorously, at least for testing.
At the moment, you have a single __cmp__ function that does everything
in-line. I would do something like this:

def __cmp__(self, other):
    result = compare_by_device_number_and_inode(self, other)
    if result is None:
        result = compare_by_file_length(self, other)
        if result is None:
            ... 
    return result

You can always unroll the function calls later, but for now it helps in
debugging: you only need to think about one type of comparison at a time,
and makes it easier to think about what each function needs to do.

> However, my program is still doing more #6 comparisons than seems
> strictly necessary when I could just toss all the filenames describing
> identical files into a list, and avoid re-comparing files with identical
> content over and over - I don't want to compare them to each other again
> and again), when there are a lot of identical files in the input list.

As others have pointed out, Python's sort never compares the same objects
more than once.

Some more comments follow interleaved with your code:

> def __cmp__(self,other):
> #		sys.stderr.write('Comparing %s and %s\n' % (self.name, other.name))

Are you sure you don't want to just sort by file name? That would be much
simpler *wink*

> 		if verbosity >= 1:
> 			sys.stderr.write('Comparing file_class objects %s and %s\n' %
> 			(self.name, other.name))
> 		if self.device == -1:
> 			if verbosity:
> 				sys.stderr.write('Getting stat data for file %s\n' % self.name)
> 			result = os.stat(self.name)
> 			self.device = result[stat.ST_DEV]
> 			self.inode = result[stat.ST_INO]
> 			self.size = result[stat.ST_SIZE]

Since you need to compare stat data for all these objects, why not
simply pre-fetch them when you create the object? That way you can
simplify your __cmp__ function significantly.

[snip]

> 		if self.device == other.device and self.inode == other.inode:
> 			# same device, same inode, must be identical files return 0

At first I thought that my news client had mangled your function, but
going back to the original post, either *your* client has mangled it, or
I've found a bug in your code. "return 0" is commented out.

Is it possible that all the excess file comparisons are being done only
for hard links?

[snip]

> 			if verbosity:
> 				sys.stderr.write('Getting prefix for file %s\n' % self.name)
> 			file = open(self.name, 'r')
> 			self.prefix = file.read(self.max_prefix_length) file.close()

"prefix" is probably not the best name. Maybe "header" (as in "file
header")?

If the prefix/header is short enough, don't worry about fetching on
demand, at least not while you are still debugging. Just pre-fetch it, get
your code working, and then convert back to fetching the header on demand.

> 		if other.prefix == -1:
> 			if verbosity:
> 				sys.stderr.write('Getting prefix for file %s\n' % other.name)
> 			file = open(other.name, 'r')
> 			other.prefix = file.read(self.max_prefix_length) file.close()
> 		if self.prefix < other.prefix:
> 			return -1
> 		elif self.prefix > other.prefix:
> 			return 1
> 		# if we've reached this point, we know that: # 1) The files are not
> 		hardlinks of each other # 2) They have identical sizes # 3) Their
> 		prefixes are identical
> 		# 4) We haven't yet tried doing a cryptographic hash, so compute them
> 		if needed, and compare them if self.hash == '':
> 			self.gen_hash()
> 		if other.hash == '':
> 			other.gen_hash()

Without knowing what gen_hash() does, it is hard to say whether the bug
might lie here. If you are using the tried-and-tested MD5 module, that
should be good, but if you have tried to roll your own, I would seriously
consider the possibility that you have a bug in the gen_hash() method.

> 		if self.hash < other.hash:
> 			return -1
> 		elif self.hash > other.hash:
> 			return 1
> 		# if we've reached this point, we know that: # 1) The files are not
> 		hardlinks of each other # 2) They have identical sizes # 3) Their
> 		prefixes are identical
> 		# 4) The cryptographic hashes are identical # 5) All that remains is a
> 		"long form" comparison, so bite the bullet and do the I/O 

Or just accept that the probability of a collision is so small that you
aren't ever going to find one.

According to this website:

http://www.iusmentis.com/technology/encryption/pgp/pgpattackfaq/hash/

the expected number of random unique files you would need to compare
before finding a single collision in the MD5 hashes is (very roughly)
10**70, or ten billion trillion trillion trillion trillion trillion.

Somehow, I don't think you've got that many legal ogg files *grin*.

If you are finding even a single MD5 collision in a trillion comparisons,
it would be a miracle, or more likely a bug.

>               if
> 		verbosity:
> 			sys.stderr.write('Doing byte for byte comparison of %s and %s\n' %
> 			(self.name, other.name))
> 		self_fp = open(self.name, 'r')
> 		other_fp = open(other.name, 'r')
> 		while 1:
> 			self_buf = self_fp.read(self.bufsize) other_buf =
> 			other_fp.read(self.bufsize) if self_buf < other_buf:
> 				self_fp.close()
> 				other_fp.close()
> 				return -1
> 			elif self_buf > other_buf:
> 				self_fp.close()
> 				other_fp.close()
> 				return 1
> 			if not self_buf and not other_buf:
> 				self_fp.close()
> 				other_fp.close()
> 				return 0
> 			if not self_buf:
> 				self_fp.close()
> 				other_fp.close()
> 				return -1
> 			if not other_buf:
> 				self_fp.close()
> 				other_fp.close()
> 				return 1

But you've stated that an invariant of the problem is if you get to
the point of comparing file contents, the files must be the same size.
That means that it is impossible for one buffer to be empty when the other
is not. If you find that condition, you should raise an error -- it means
the file has changed size in the time between checking that the lengths
are equal, and reading the file contents into a buffer.

-- 
Steven.