[Numpy-discussion] Does np.std() make two passes through the data?

Mon Nov 22 12:03:30 EST 2010

On Sun, Nov 21, 2010 at 5:56 PM, Robert Kern <robert.kern at gmail.com> wrote:
> On Sun, Nov 21, 2010 at 19:49, Keith Goodman <kwgoodman at gmail.com> wrote:
>
>> But this sample gives a difference:
>>
>>>> a = np.random.rand(100)
>>>> a.var()
>>   0.080232196646619805
>>>> var(a)
>>   0.080232196646619791
>>
>> As you know, I'm trying to make a drop-in replacement for
>> scipy.stats.nanstd. Maybe I'll have to add an asterisk to the drop-in
>> part. Either that, or suck it up and store the damn mean.
>
> The difference is less than eps. Quite possibly, the one-pass version
> is even closer to the true value than the two-pass version.

I wrote 3 cython prototype implementations of nanstd for 1d float64 arrays:

>> a = np.random.rand(1000000)

# numpy; doesn't take care of NaNs
>> a.std()
   0.28852169850186793

# cython of np.sqrt(((arr - arr.mean())**2).mean())
>> nanstd_twopass(a, ddof=0)
   0.28852169850186798

# cython of np.sqrt((arr*arr).mean() - arr.mean()**2)
>> nanstd_simple(a, ddof=0)
   0.28852169850187437

# http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm
>> nanstd_online(a, ddof=0)
   0.28852169850187243

# My target, scipy version
>> from scipy.stats import nanstd
>> nanstd(a, bias=True)
   0.28852169850186798

Timing:

>> timeit nanstd(a, bias=True)
10 loops, best of 3: 27.8 ms per loop
>> timeit a.std()
100 loops, best of 3: 11.5 ms per loop
>> timeit nanstd_twopass(a, ddof=0)
100 loops, best of 3: 3.24 ms per loop
>> timeit nanstd_simple(a, ddof=0)
1000 loops, best of 3: 1.6 ms per loop
>> timeit nanstd_online(a, ddof=0)
100 loops, best of 3: 10.8 ms per loop

nanstd_simple is the fastest but I assume the algorithm is no good for
general use?

I think I'll go with nanstd_twopass. It will most closely match
numpy/scipy, is more robust than nanstd_simple, and is the second
fastest.

Here's the code. Improvements welcomed.

@cython.boundscheck(False)
@cython.wraparound(False)
def nanstd_simple(np.ndarray[np.float64_t, ndim=1] a, int ddof):
    "nanstd of 1d numpy array with dtype=np.float64 along axis=0."
    cdef Py_ssize_t i
    cdef int a0 = a.shape[0], count = 0
    cdef np.float64_t asum = 0, a2sum=0, ai
    for i in range(a0):
        ai = a[i]
        if ai == ai:
            asum += ai
            a2sum += ai * ai
            count += 1
    if count > 0:
        asum = asum * asum
        return sqrt((a2sum - asum / count) / (count - ddof))
    else:
        return np.float64(NAN)

@cython.boundscheck(False)
@cython.wraparound(False)
def nanstd_online(np.ndarray[np.float64_t, ndim=1] a, int ddof):
    "nanstd of 1d numpy array with dtype=np.float64 along axis=0."
    cdef Py_ssize_t i
    cdef int a0 = a.shape[0], n = 0
    cdef np.float64_t mean = 0, M2 = 0, delta, x
    for i in range(a0):
        x = a[i]
        if x == x:
            n += 1
            delta = x - mean
            mean = mean + delta / n
            M2 = M2 + delta * (x - mean)
    if n > 0:
        return np.float64(sqrt(M2 / (n - ddof)))
    else:
        return np.float64(NAN)

@cython.boundscheck(False)
@cython.wraparound(False)
def nanstd_twopass(np.ndarray[np.float64_t, ndim=1] a, int ddof):
    "nanstd of 1d numpy array with dtype=np.float64 along axis=0."
    cdef Py_ssize_t i
    cdef int a0 = a.shape[0], count = 0
    cdef np.float64_t asum = 0, a2sum=0, amean, ai, da
    for i in range(a0):
        ai = a[i]
        if ai == ai:
            asum += ai
            count += 1
    if count > 0:
        amean = asum / count
        asum = 0
        for i in range(a0):
            ai = a[i]
            if ai == ai:
                da = ai - amean
                asum += da
                a2sum += (da * da)
        asum = asum * asum
        return sqrt((a2sum - asum / count) / (count - ddof))
    else:
        return np.float64(NAN)