[Python-Dev] Drastically improving list.sort() for lists of strings/ints

Tue Sep 13 14:08:35 EDT 2016

[Terry Reedy <tjreedy at udel.edu>]
>> Tim Peters investigated and empirically determined that an
>> O(n*n) binary insort, as he optimized it on real machines, is faster
>> than O(n*logn) sorting for up to around 64 items.

[Nikolaus Rath <Nikolaus at rath.org>]
> Out of curiosity: is this test repeated periodically on different
> architectures? Or could it be that it only ever was true 10 years ago on
> Tim's Power Mac G5 (or whatever he used)?

It has little to do with architecture, but much to do with the
relative cost of comparisons versus pointer-copying.  Near the end of

https://github.com/python/cpython/blob/master/Objects/listsort.txt

"""
BINSORT
A "binary insertion sort" is just like a textbook insertion sort, but instead
of locating the correct position of the next item via linear (one at a time)
search, an equivalent to Python's bisect.bisect_right is used to find the
correct position in logarithmic time.  Most texts don't mention this
variation, and those that do usually say it's not worth the bother:  insertion
sort remains quadratic (expected and worst cases) either way.  Speeding the
search doesn't reduce the quadratic data movement costs.

But in CPython's case, comparisons are extraordinarily expensive compared to
moving data, and the details matter.  Moving objects is just copying
pointers.  Comparisons can be arbitrarily expensive (can invoke arbitrary
user-supplied Python code), but even in simple cases (like 3 < 4) _all_
decisions are made at runtime:  what's the type of the left comparand?  the
type of the right?  do they need to be coerced to a common type?  where's the
code to compare these types?  And so on.  Even the simplest Python comparison
triggers a large pile of C-level pointer dereferences, conditionals, and
function calls.

So cutting the number of compares is almost always measurably helpful in
CPython, and the savings swamp the quadratic-time data movement costs for
reasonable minrun values.
"""

Binsort does a close to optimal number of comparisons on randomly
ordered data, and that's the point.  Also, in the context of the
overall sorting algorithm, binsort is used to _extend_ the length of a
naturally occurring "too short" run.  There's no need to sort the
whole thing from scratch, because we already know the prefix is
sorted.  That makes binsort a more-than-less obvious choice.(it takes
full advantage of knowing that the prefix is already ordered).

As that doc also says:

"""
When N is a power of 2, testing on random data showed that minrun values of
16, 32, 64 and 128 worked about equally well. At 256 the data-movement cost
in binary insertion sort clearly hurt, and at 8 the increase in the number
of function calls clearly hurt.
"""

So it settled on forcing minrun into the range 32 <= minrun <= 64 (the
precise value depends on the number of elements in the entire array,
for reasons also explained in that doc).  That's far from either end
where the value clearly mattered.

If the full path through Python's expensive
PyObject_RichCompareBool(X, Y, Py_LT) has gotten significantly faster,
a smaller minrun range may make more sense now; or if it's gotten
significantly slower, a larger minrun range.

But, no, I don't believe anyone retests it.

IIRC, when the algorithm was adopted in Java, they found a minrun
range of 16 through 32 worked marginally better for them, because
_their_ spelling of PyObject_RichCompareBool (for Java object
comparison methods) is faster than CPython's.