on sorting things

Fri Dec 20 13:01:34 EST 2019

Eli the Bearded wrote:

> In comp.lang.python, Peter Otten  <__peter__ at web.de> wrote:
>> Eli the Bearded wrote:
>>> But what caught my eye most, as someone relatively new to Python but
>>> with long experience in C in Perl, is sorting doesn't take a
> 
> s/C in /C and/
> 
> Ugh.
> 
>>> *comparison* function, it takes a *key generator* function, and that
>>> function is supposed to transform the thing into something that the
>>> native comparison knows how to compare.
>>> 
>>> This seems a strange choice, and I'm wondering if someone can explain
>>> the benefits of doing it that way to me.
>> 
>> Python 2 started with a comparison function and then grew a key function.
>> With a key function you still have to compare items, you are just
>> breaking the comparison into two steps:
> 
> [snip]
> 
> Thanks for that good explanation. The benchmark comparison makes it
> very thorough.
> 
> In my mind I gravitate towards the complicated sorts of sort that can be
> quickly compared for some sorts of keys and not as quickly for others.
> 
> Consider a sort that first compares file size and if the same number of
> bytes, then compares file checksum. Any decently scaled real world
> implementation would memoize the checksum for speed, but only work it out
> for files that do not have a unique file size. The key method requires
> it worked out in advance for everything.

Oscar already mentioned the functools.cmp_to_key decorator which makes this 
a non-issue:

def mycmp(a, b): ...

files.sort(key=cmp_to_key(mycmp))

Applied to your example, with memoization:

# untested

def cmp(a, b):
    return (a > b)-(a < b)

def make_file_key():
    size = functools.lru_cache(None)(getsize)
    checksum = functools.lru_cache(None)(getchecksum)

    @functools.cmp_to_key
    def file_key(a, b):
        return cmp(size(a), size(b)) or cmp(checksum(a), checksum(b))
    return file_key

files.sort(key=make_file_key())

> But I see the key method handles the memoization under the hood for you,
> so those simpler, more common sorts of sort get an easy to see benefit.
> 
> Elijah
> ------
> even memoizing the stat() calls would help for large lists

PS: If you are sorting files by size and checksum as part of a deduplication 
effort consider using dict-s instead:

def grouped(items, key):
    result = defaultdict(list)
    for item in items:
        result[key(item)].append(item)
    return result

for same_size in grouped(files, key=getsize).values():
    if len(same_size) > 1:
        for same_checksum in grouped(same_size, key=getchecksum).values():
            if len(same_checksum) > 1:
                print(same_checksum)