[Numpy-discussion] A faster median (Wirth's method)

Sturla Molden sturla at molden.no
Tue Sep 1 00:06:50 EDT 2009


We recently has a discussion regarding an optimization of NumPy's median 
to average O(n) complexity. After some searching, I found out there is a 
selection algorithm competitive in speed with Hoare's quick select. It 
has the advantage of being a lot simpler to implement. In plain Python:

import numpy as np

def wirthselect(array, k):
   
    """ Niklaus Wirth's selection algortithm """

    a = np.ascontiguousarray(array)
    if (a is array): a = a.copy()

    l = 0
    m = a.shape[0] - 1
    while l < m:
        x = a[k]
        i = l
        j = m
        while 1:
            while a[i] < x: i += 1
            while x < a[j]: j -= 1
            if i <= j:
                tmp = a[i]
                a[i] = a[j]
                a[j] = tmp
                i += 1
                j -= 1
            if i > j: break
        if j < k: l = i
        if k < i: m = j

    return a


Now, the median can be obtained in average O(n) time as:


def median(x):

    """ median in average O(n) time """

    n = x.shape[0]
    k = n >> 1
    s = wirthselect(x, k)
    if n & 1:
        return s[k]
    else:
        return 0.5*(s[k]+s[:k].max())


The beauty of this is that Wirth select is extremely easy to migrate to 
Cython:


import numpy
ctypedef numpy.double_t T # or whatever

def wirthselect(numpy.ndarray[T, ndim=1] array, int k):
   
    cdef int i, j, l, m
    cdef T x, tmp
    cdef T *a

    _array = np.ascontiguousarray(array)
    if (_array is array): _array = _array.copy()
    a = <T *> _array.data

    l = 0
    m = <int> a.shape[0] - 1
    with nogil:
        while l < m:
            x = a[k]
            i = l
            j = m
            while 1:
                while a[i] < x: i += 1
                while x < a[j]: j -= 1
                if i <= j:
                    tmp = a[i]
                    a[i] = a[j]
                    a[j] = tmp
                    i += 1
                    j -= 1
                if i > j: break
            if j < k: l = i
            if k < i: m = j

    return _array


For example, we could have a small script that generates withselect for 
all NumPy dtypes (T as template), and use a dict as jump table.

Chad, you can continue to write quick select using NumPy's C quick sort 
in numpy/core/src/_sortmodule.c.src.  When you are done, it might be 
about 10% faster than this. :-)


Reference:
http://ndevilla.free.fr/median/median.pdf



Best regards,
Sturla Molden










More information about the NumPy-Discussion mailing list