[Numpy-discussion] Sorting refactor

Fri Jan 16 06:33:32 EST 2015

2015-01-16 11:55 GMT+01:00  <numpy-discussion-request at scipy.org>:
> Message: 2
> Date: Thu, 15 Jan 2015 21:24:00 -0800
> From: Jaime Fern?ndez del R?o <jaime.frio at gmail.com>
> Subject: [Numpy-discussion] Sorting refactor
> To: Discussion of Numerical Python <numpy-discussion at scipy.org>
> Message-ID:
>         <CAPOWHWkF6RnWcrGmcwsmq_LO3hShjgBVLsrN19z-MDPe25E2Aw at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> This changes will make it easier for me to add a Timsort generic type
> function to numpy's arsenal of sorting routines. And I think they have
> value by cleaning the source code on their own.

Yes, they do. I've been looking at the sorting functions as well and
I've found the following:

* The code is generally hard to read because it prefers pointer over
indices. I'm wondering if it would get slower using indices. The
closer these algorithms are to the textbook, the easier to insert
fancy optimizations.

* The heap sort exploits undefined behavior by using a pointer that
points before the start of the array. However, rewriting it to always
point within the array made it slower. I haven't tried rewriting it
using indices.

* Quicksort has a quadratic time worst case. I think it should be
turned into an introsort [1] for O(n log n) worst case; we have the
heapsort needed to do that.

* Quicksort is robust to repeated elements, but doesn't exploit them.
It can be made to run in linear time if the input array has only O(1)
distinct elements [2]. This may come at the expense of some
performance on arrays with no repeated elements.

* Using optimal sorting networks instead of insertion sort as the base
case can speed up quicksort on float arrays by 5-10%, but only if NaNs
are moved out of the way first so that comparisons become cheaper [3].
This has consequences for the selection algorithms that I haven't
fully worked out yet.

* Using Cilk Plus to parallelize merge sort can make it significantly
faster than quicksort at the expense of only a few lines of code, but
I haven't checked whether Cilk Plus plays nicely with multiprocessing
and other parallelism options (remember the trouble with OpenMP-ified
OpenBLAS?).

This isn't really an answer to your questions, more like a brain dump
from someone who's stared at the same code for a while and did some
experiments. I'm not saying we should implement all of this, but keep
in mind that there are some interesting options besides implementing
timsort.

[1] https://en.wikipedia.org/wiki/Introsort
[2] http://www.sorting-algorithms.com/static/QuicksortIsOptimal.pdf
[3] https://github.com/larsmans/numpy/tree/sorting-nets