sorting with expensive compares?

Dan Stromberg strombrg at dcs.nac.uci.edu
Fri Dec 23 12:10:22 EST 2005


On Thu, 22 Dec 2005 22:06:42 +0000, Dan Stromberg wrote:

> 
> Hi folks.
> 
> Python appears to have a good sort method, but when sorting array elements
> that are very large, and hence have very expensive compares, is there some
> sort of already-available sort function that will merge like elements into
> a chain, so that they won't have to be recompared as many times?
> 
> Thanks!

I probably should've been more specific.

I'm wanting to sort a large number of files, like a bunch of output files
from a large series of rsh or ssh outputs on a large series of distinct
machines, a music collection in .ogg format (strictly redistributable and
legally purchased music), a collection of .iso cdrom images (strictly
redistributable and legally purchased software), and so forth.

I'm not sorting the content of each file individually.

I'm treating each file as a potentially very large string, and "sorting
the strings".

I've been using the following compare function, which in short checks, in
order:

1) device number
2) inode number
3) file length
4) the beginning of the file
5) an md5 hash of the entire file
6) the entire file

(If #1 and #2 are identical, then the file must be a hardlink to the other
file.  Also, files that do not have the same length can never be
identical.  And of course these items are generated on demand, making it
frequently possible to avoid doing full-file-length compare on a lot of
files)

However, my program is still doing more #6 comparisons than seems
strictly necessary when I could just toss all the filenames describing
identical files into a list, and avoid re-comparing files with identical
content over and over - I don't want to compare them to each other again
and again), when there are a lot of identical files in the input list.

Thanks!

def __cmp__(self,other):
#		sys.stderr.write('Comparing %s and %s\n' % (self.name, other.name))
		if verbosity >= 1:
			sys.stderr.write('Comparing file_class objects %s and %s\n' %
			(self.name, other.name))
		if self.device == -1:
			if verbosity:
				sys.stderr.write('Getting stat data for file %s\n' % self.name)
			result = os.stat(self.name)
			self.device = result[stat.ST_DEV]
			self.inode = result[stat.ST_INO]
			self.size = result[stat.ST_SIZE]
		if other.device == -1:
			if verbosity:
				sys.stderr.write('Getting stat data for file %s\n' % other.name)
			result = os.stat(other.name)
			other.device = result[stat.ST_DEV]
			other.inode = result[stat.ST_INO]
			other.size = result[stat.ST_SIZE]
		if self.device == other.device and self.inode == other.inode:
			# same device, same inode, must be identical files return 0
		if self.length < other.length:
			return -1
		elif self.length > other.length:
			return 1
		# if we've reached this point, the files are not hardlinks, and their
		lengths are identical # so slurp in the prefixes if needed, then compare
		them if self.prefix == -1:
			if verbosity:
				sys.stderr.write('Getting prefix for file %s\n' % self.name)
			file = open(self.name, 'r')
			self.prefix = file.read(self.max_prefix_length) file.close()
		if other.prefix == -1:
			if verbosity:
				sys.stderr.write('Getting prefix for file %s\n' % other.name)
			file = open(other.name, 'r')
			other.prefix = file.read(self.max_prefix_length) file.close()
		if self.prefix < other.prefix:
			return -1
		elif self.prefix > other.prefix:
			return 1
		# if we've reached this point, we know that: # 1) The files are not
		hardlinks of each other # 2) They have identical sizes # 3) Their
		prefixes are identical
		# 4) We haven't yet tried doing a cryptographic hash, so compute them if
		needed, and compare them if self.hash == '':
			self.gen_hash()
		if other.hash == '':
			other.gen_hash()
		if self.hash < other.hash:
			return -1
		elif self.hash > other.hash:
			return 1
		# if we've reached this point, we know that: # 1) The files are not
		hardlinks of each other # 2) They have identical sizes # 3) Their
		prefixes are identical
		# 4) The cryptographic hashes are identical # 5) All that remains is a
		"long form" comparison, so bite the bullet and do the I/O if verbosity:
			sys.stderr.write('Doing byte for byte comparison of %s and %s\n' %
			(self.name, other.name))
		self_fp = open(self.name, 'r')
		other_fp = open(other.name, 'r')
		while 1:
			self_buf = self_fp.read(self.bufsize) other_buf =
			other_fp.read(self.bufsize) if self_buf < other_buf:
				self_fp.close()
				other_fp.close()
				return -1
			elif self_buf > other_buf:
				self_fp.close()
				other_fp.close()
				return 1
			if not self_buf and not other_buf:
				self_fp.close()
				other_fp.close()
				return 0
			if not self_buf:
				self_fp.close()
				other_fp.close()
				return -1
			if not other_buf:
				self_fp.close()
				other_fp.close()
				return 1





More information about the Python-list mailing list