on sorting things

Thu Dec 19 03:46:51 EST 2019

Eli the Bearded wrote:

> I recently saw a link to an old post on a blog and then started looking
> at the newer posts. This one:
> 
> https://leancrew.com/all-this/2019/11/the-key-to-sorting-in-python/
> 
> discusses ways to deal with useful sorting of movie / television show
> titles. Some initial words should be re-ordered for sorting purposes
> (_Adventures of Huckleberry Finn, The_), roman numbers should sort like
> regular numbers (_Rocky V_ comes before _Rocky IX_), and something needs
> to be done about sorting accented vowels.
> 
> But what caught my eye most, as someone relatively new to Python but
> with long experience in C in Perl, is sorting doesn't take a
> *comparison* function, it takes a *key generator* function, and that
> function is supposed to transform the thing into something that the
> native comparison knows how to compare.
> 
> This seems a strange choice, and I'm wondering if someone can explain
> the benefits of doing it that way to me.

Python 2 started with a comparison function and then grew a key function.
With a key function you still have to compare items, you are just breaking 
the comparison into two steps:

- Calculate the key. This can be slow as it has to be done once per item.
- Compare the items. The key is usually a string or tuple, and for those
  the comparison runs C code which is a bit faster. The gain is even greater
  as the comparison is performed much more often.

The disadvantages of the key-based approach are that some comparison 
functions cannot naturally be rewritten in this way and that memory usage is 
a bit higher. Both are typically "expert" problems, so the developers 
decided to "nudge" users to consider keys first.

Now take a look at this example written in Python 2:

import random
import time
import contextlib

@contextlib.contextmanager
def measure(message):
    start = time.time()
    yield
    span = time.time() - start
    print message.format(span)

with open("/usr/share/dict/words") as f:
    words = f.read().splitlines()
random.shuffle(words)

def key_lower(a):
    global key_calls
    key_calls += 1
    return a.lower()

def cmp_lower(a, b):
    global cmp_calls
    cmp_calls += 1
    return cmp(a.lower(), b.lower())

key_calls = 0
cmp_calls = 0

with measure("cmp-based sort took {} secs"):
    words[:].sort(cmp_lower)

with measure("key-bases sort took {} secs"):
    words[:].sort(key=key_lower)

print "cmp called", cmp_calls, "times"
print "key called", key_calls, "times"

$ python2 sort_demo.py 
cmp-based sort took 1.50221800804 secs
key-bases sort took 0.293763160706 secs
cmp called 1516111 times
key called 99171 times