Guido rethinking removal of cmp from sort method

Dan Stromberg drsalists at gmail.com
Sat Mar 26 16:29:35 EDT 2011


(Trying again to get through to python-list)

On Sat, Mar 26, 2011 at 2:10 AM, Steven D'Aprano <steve at pearwood.info>wrote:

>
> Dan Stromberg wrote:
>
>  There's also the issue of a lazy comparison function, that I don't seem to
>> have gotten a response to - if you have a Very expensive comparison
>> operation involving megabytes or even gigabytes of data, and want to
>> perform
>> just enough of it to get the job done, how does one do this with a key
>> comparator and no cmp comparator?
>>
>
>
> Okay, that's a reasonable use-case: lazy comparisons. Thank you.
>
And thank you.


> [...]
>
>  (The version at the following URL might be a little newer)
>>
>> http://stromberg.dnsalias.org/cgi-bin/viewvc.cgi/equivalence-classes/trunk/equivs2-python/?root=svn
>>
>
>
> By the way, you have a logic error in that code. You state:
>
>
> # if we were asked not to do full comparisons, just trust the result so far
> - don't do the ultra-expensive
> # "compare every single byte" procedure
>
> But that is *cheaper* that a cryptographic hash. The crypto hash still has
> to look at every byte, but rather than just compare two bytes, it slices and
> dices them using a very expensive cryptographically secure hash function.
> Nor can you bail out early if you find mismatched bytes. The hash functions
> have to read every single byte before returning.
>

It might be the surrounding context that makes the difference.

For comparing two files, byte for byte is always cheaper, or at worst the
same expense when the files are equal.

For comparing 1,000,000 files, each to all the other 999,999, the
cryptographic hash is probably cheaper in most cases. Even when the files
are all local.


> BUFFERSIZE = 256*1024  # 256K
>
> def file_compare(a, b, buffersize=None):
>    """file_compare(a, b [, buffersize]) -> int status code
>
>    Compare files with names a and b, and return an integer status code:
>      0   the files are different
>      1   the filenames refer to the same physical location in the file
>          system, e.g. a='./file.txt' and b='./dir/../file.txt'
>      2   the filenames refer to a single hardlinked file
>      3   the filenames refer to distinct files with identical contents
>
>    """
>    a = os.path.normcase( os.path.normpath(a) )
>    b = os.path.normcase( os.path.normpath(b) )
>    if a == b:  # Same file name after normalisation.
>        assert os.path.samefile(a, b)
>        return 1
>    elif ishardlink(a, b):
>        assert os.path.samefile(a, b)
>        return 2
>    assert not os.path.samefile(a, b)
>    if file_equals(a, b, buffersize):
>        return 3
>    return 0
>
>
> def ishardlink(a, b):
>    """ishardlink(a, b) -> True|False
>
>    Return True if filenames a and b are the same hardlinked file,
>    otherwise False.
>    """
>    s1 = os.stat(a)
>    s2 = os.stat(b)
>    return (s1.st_ino, s1.st_dev) == (s2.st_ino, s2.st_dev)
>
>
> def file_equals(a, b, buffersize=None):
>    """file_equals(a, b[, buffersize]) -> True|False
>
>    Return True if files with names a and b have the same content,
>    otherwise return False.
>
>    Files a and b are equal if every byte in the files match.
>    """
>    if os.path.getsize(a) != os.path.getsize(b):
>        return False
>    if buffersize is None:
>        buffersize = BUFFERSIZE
>    fa = open(a, 'rb')
>    fb = open(b, 'rb')
>    ba = bb = ''
>    while 1:
>        ba = fa.read(buffersize)
>        bb = fb.read(buffersize)
>        if ba != bb:
>            return False
>        elif ba == bb == '':
>            break
>    return True
>
> Thanks for showing me this.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20110326/951b29ca/attachment-0001.html>


More information about the Python-list mailing list