[SciPy-User] Moving median code

Sun May 1 19:38:39 EDT 2011

On 05/01/2011 02:17 PM, Wes McKinney wrote:
> On Sun, May 1, 2011 at 3:11 PM, Wes McKinney<wesmckinn at gmail.com>  wrote:
>> On Sun, May 1, 2011 at 3:08 PM, Keith Goodman<kwgoodman at gmail.com>  wrote:
>>> On Mon, Apr 25, 2011 at 8:31 AM, J. David Lee<johnl at cs.wisc.edu>  wrote:
>>>> In working on a project recently, I wrote a moving median code that is
>>>> about 10x faster than scipy.ndimage.median_filter. It utilizes a linked
>>>> list structure to store values and track the median value. If anyone is
>>>> interested, I've posted the code here (about 150 lines):
>>>>
>>>> http://pages.cs.wisc.edu/~johnl/median_code/median_code.c
>>> I wrapped your moving window median code in cython. Here is how the
>>> timing scales with window size:
>>>
>>>>> a = np.random.rand(1e6)
>>>>> timeit move_median(a, window=10)
>>> 10 loops, best of 3: 44.4 ms per loop
>>>>> timeit move_median(a, window=100)
>>> 10 loops, best of 3: 179 ms per loop
>>>>> timeit move_median(a, window=1000)
>>> 1 loops, best of 3: 1.96 s per loop
>>>>> timeit move_median(a, window=10000)
>>> 1 loops, best of 3: 48.3 s per loop
>>>
>>> It would be interesting to compare the timings with a double heap
>>> algorithm [1] and with a C version of Wes's skiplist. I'll try to wrap
>>> the double heap next when I get a chance. BTW, what license are you
>>> using for your code?
>>>
>>> Here's the code I used. It's the first wrap I've done so suggestions
>>> welcomed. Now that I sort of know how, I'm going to wrap everything I
>>> see!
>>>
>>> import numpy as np
>>> cimport numpy as np
>>> import cython
>>>
>>> cdef extern from "cmove_median.c":
>>>     struct mm_node
>>>     struct mm_list:
>>>         np.npy_int64 len
>>>         mm_node *head
>>>         mm_node *tail
>>>         mm_node *min_node
>>>         mm_node *med_node
>>>     void mm_init_median(mm_list *mm)
>>>     void mm_insert_init(mm_list *mm, np.npy_float64 val)
>>>     void mm_update(mm_list *mm, np.npy_float64 val)
>>>     np.npy_float64 mm_get_median(mm_list *mm)
>>>     void mm_free(mm_list *mm)
>>>     np.npy_float64 mm_get_median(mm_list *mm)
>>>     mm_list mm_new(np.npy_int64 len)
>>>
>>> def move_median(np.ndarray[np.float64_t, ndim=1] a, int window):
>>>     cdef int n = a.size, i
>>>     if window>  n:
>>>         raise ValueError("`window` must be less than a.size.")
>>>     if window<  2:
>>>         raise ValueError("I get a segfault when `window` is 1.")
>>>     cdef np.ndarray[np.float64_t, ndim=1] y = np.empty(n)
>>>     cdef mm_list mm = mm_new(window)
>>>     for i in range(window):
>>>         mm_insert_init(cython.address(mm), a[i])
>>>         y[i] = np.nan
>>>     mm_init_median(cython.address(mm))
>>>     y[window-1] = mm_get_median(cython.address(mm))
>>>     for i in range(window, n):
>>>         mm_update(cython.address(mm), a[i])
>>>         y[i] = mm_get_median(cython.address(mm))
>>>     mm_free(cython.address(mm))
>>>     return y
>>>
>>> [1] http://home.tiac.net/~cri_a/source_code/index.html#winmedian_pkg
>>> _______________________________________________
>>> SciPy-User mailing list
>>> SciPy-User at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>
>> assuming a compatible license do you want to make a little github repo
>> like you mentioned? I have been interested in this problem for a while
>>
> you can see that the skiplist method is very much O(N log W). a lot of
> python overhead though, obviously
>
> In [16]: timeit rolling_median(arr, 10)
> 1 loops, best of 3: 2.95 s per loop
>
> In [17]: timeit rolling_median(arr, 100)
> 1 loops, best of 3: 4.06 s per loop
>
> In [18]: timeit rolling_median(arr, 1000)
> 1 loops, best of 3: 5.56 s per loop
>
> In [19]: timeit rolling_median(arr, 10000)
> 1 loops, best of 3: 7.78 s per loop
>
> probably the most optimal solution would be to have 2 algorithms: one
> for small window sizes (where linear updates are OK) and then switch
> over when your big-O constants "cross over" for large enough window
> size
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
I don't know what the end goal of this is, but if we're looking at scipy 
itself, I think it would be nice to have a single function call that 
would select the best method based on the window size, but it would also 
be nice to have the option to call a specific implementation if desired, 
because the speed of the median will depend on both the implementation 
and the data in question.

David