[SciPy-User] Moving median code
J. David Lee
johnl at cs.wisc.edu
Sun May 1 19:38:39 EDT 2011
On 05/01/2011 02:17 PM, Wes McKinney wrote:
> On Sun, May 1, 2011 at 3:11 PM, Wes McKinney<wesmckinn at gmail.com> wrote:
>> On Sun, May 1, 2011 at 3:08 PM, Keith Goodman<kwgoodman at gmail.com> wrote:
>>> On Mon, Apr 25, 2011 at 8:31 AM, J. David Lee<johnl at cs.wisc.edu> wrote:
>>>> In working on a project recently, I wrote a moving median code that is
>>>> about 10x faster than scipy.ndimage.median_filter. It utilizes a linked
>>>> list structure to store values and track the median value. If anyone is
>>>> interested, I've posted the code here (about 150 lines):
>>>>
>>>> http://pages.cs.wisc.edu/~johnl/median_code/median_code.c
>>> I wrapped your moving window median code in cython. Here is how the
>>> timing scales with window size:
>>>
>>>>> a = np.random.rand(1e6)
>>>>> timeit move_median(a, window=10)
>>> 10 loops, best of 3: 44.4 ms per loop
>>>>> timeit move_median(a, window=100)
>>> 10 loops, best of 3: 179 ms per loop
>>>>> timeit move_median(a, window=1000)
>>> 1 loops, best of 3: 1.96 s per loop
>>>>> timeit move_median(a, window=10000)
>>> 1 loops, best of 3: 48.3 s per loop
>>>
>>> It would be interesting to compare the timings with a double heap
>>> algorithm [1] and with a C version of Wes's skiplist. I'll try to wrap
>>> the double heap next when I get a chance. BTW, what license are you
>>> using for your code?
>>>
>>> Here's the code I used. It's the first wrap I've done so suggestions
>>> welcomed. Now that I sort of know how, I'm going to wrap everything I
>>> see!
>>>
>>> import numpy as np
>>> cimport numpy as np
>>> import cython
>>>
>>> cdef extern from "cmove_median.c":
>>> struct mm_node
>>> struct mm_list:
>>> np.npy_int64 len
>>> mm_node *head
>>> mm_node *tail
>>> mm_node *min_node
>>> mm_node *med_node
>>> void mm_init_median(mm_list *mm)
>>> void mm_insert_init(mm_list *mm, np.npy_float64 val)
>>> void mm_update(mm_list *mm, np.npy_float64 val)
>>> np.npy_float64 mm_get_median(mm_list *mm)
>>> void mm_free(mm_list *mm)
>>> np.npy_float64 mm_get_median(mm_list *mm)
>>> mm_list mm_new(np.npy_int64 len)
>>>
>>> def move_median(np.ndarray[np.float64_t, ndim=1] a, int window):
>>> cdef int n = a.size, i
>>> if window> n:
>>> raise ValueError("`window` must be less than a.size.")
>>> if window< 2:
>>> raise ValueError("I get a segfault when `window` is 1.")
>>> cdef np.ndarray[np.float64_t, ndim=1] y = np.empty(n)
>>> cdef mm_list mm = mm_new(window)
>>> for i in range(window):
>>> mm_insert_init(cython.address(mm), a[i])
>>> y[i] = np.nan
>>> mm_init_median(cython.address(mm))
>>> y[window-1] = mm_get_median(cython.address(mm))
>>> for i in range(window, n):
>>> mm_update(cython.address(mm), a[i])
>>> y[i] = mm_get_median(cython.address(mm))
>>> mm_free(cython.address(mm))
>>> return y
>>>
>>> [1] http://home.tiac.net/~cri_a/source_code/index.html#winmedian_pkg
>>> _______________________________________________
>>> SciPy-User mailing list
>>> SciPy-User at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>
>> assuming a compatible license do you want to make a little github repo
>> like you mentioned? I have been interested in this problem for a while
>>
> you can see that the skiplist method is very much O(N log W). a lot of
> python overhead though, obviously
>
> In [16]: timeit rolling_median(arr, 10)
> 1 loops, best of 3: 2.95 s per loop
>
> In [17]: timeit rolling_median(arr, 100)
> 1 loops, best of 3: 4.06 s per loop
>
> In [18]: timeit rolling_median(arr, 1000)
> 1 loops, best of 3: 5.56 s per loop
>
> In [19]: timeit rolling_median(arr, 10000)
> 1 loops, best of 3: 7.78 s per loop
>
> probably the most optimal solution would be to have 2 algorithms: one
> for small window sizes (where linear updates are OK) and then switch
> over when your big-O constants "cross over" for large enough window
> size
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
I don't know what the end goal of this is, but if we're looking at scipy
itself, I think it would be nice to have a single function call that
would select the best method based on the window size, but it would also
be nice to have the option to call a specific implementation if desired,
because the speed of the median will depend on both the implementation
and the data in question.
David
More information about the SciPy-User
mailing list