[Numpy-discussion] Does np.std() make two passes through the data?
Keith Goodman
kwgoodman at gmail.com
Mon Nov 22 12:03:30 EST 2010
On Sun, Nov 21, 2010 at 5:56 PM, Robert Kern <robert.kern at gmail.com> wrote:
> On Sun, Nov 21, 2010 at 19:49, Keith Goodman <kwgoodman at gmail.com> wrote:
>
>> But this sample gives a difference:
>>
>>>> a = np.random.rand(100)
>>>> a.var()
>> 0.080232196646619805
>>>> var(a)
>> 0.080232196646619791
>>
>> As you know, I'm trying to make a drop-in replacement for
>> scipy.stats.nanstd. Maybe I'll have to add an asterisk to the drop-in
>> part. Either that, or suck it up and store the damn mean.
>
> The difference is less than eps. Quite possibly, the one-pass version
> is even closer to the true value than the two-pass version.
I wrote 3 cython prototype implementations of nanstd for 1d float64 arrays:
>> a = np.random.rand(1000000)
# numpy; doesn't take care of NaNs
>> a.std()
0.28852169850186793
# cython of np.sqrt(((arr - arr.mean())**2).mean())
>> nanstd_twopass(a, ddof=0)
0.28852169850186798
# cython of np.sqrt((arr*arr).mean() - arr.mean()**2)
>> nanstd_simple(a, ddof=0)
0.28852169850187437
# http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm
>> nanstd_online(a, ddof=0)
0.28852169850187243
# My target, scipy version
>> from scipy.stats import nanstd
>> nanstd(a, bias=True)
0.28852169850186798
Timing:
>> timeit nanstd(a, bias=True)
10 loops, best of 3: 27.8 ms per loop
>> timeit a.std()
100 loops, best of 3: 11.5 ms per loop
>> timeit nanstd_twopass(a, ddof=0)
100 loops, best of 3: 3.24 ms per loop
>> timeit nanstd_simple(a, ddof=0)
1000 loops, best of 3: 1.6 ms per loop
>> timeit nanstd_online(a, ddof=0)
100 loops, best of 3: 10.8 ms per loop
nanstd_simple is the fastest but I assume the algorithm is no good for
general use?
I think I'll go with nanstd_twopass. It will most closely match
numpy/scipy, is more robust than nanstd_simple, and is the second
fastest.
Here's the code. Improvements welcomed.
@cython.boundscheck(False)
@cython.wraparound(False)
def nanstd_simple(np.ndarray[np.float64_t, ndim=1] a, int ddof):
"nanstd of 1d numpy array with dtype=np.float64 along axis=0."
cdef Py_ssize_t i
cdef int a0 = a.shape[0], count = 0
cdef np.float64_t asum = 0, a2sum=0, ai
for i in range(a0):
ai = a[i]
if ai == ai:
asum += ai
a2sum += ai * ai
count += 1
if count > 0:
asum = asum * asum
return sqrt((a2sum - asum / count) / (count - ddof))
else:
return np.float64(NAN)
@cython.boundscheck(False)
@cython.wraparound(False)
def nanstd_online(np.ndarray[np.float64_t, ndim=1] a, int ddof):
"nanstd of 1d numpy array with dtype=np.float64 along axis=0."
cdef Py_ssize_t i
cdef int a0 = a.shape[0], n = 0
cdef np.float64_t mean = 0, M2 = 0, delta, x
for i in range(a0):
x = a[i]
if x == x:
n += 1
delta = x - mean
mean = mean + delta / n
M2 = M2 + delta * (x - mean)
if n > 0:
return np.float64(sqrt(M2 / (n - ddof)))
else:
return np.float64(NAN)
@cython.boundscheck(False)
@cython.wraparound(False)
def nanstd_twopass(np.ndarray[np.float64_t, ndim=1] a, int ddof):
"nanstd of 1d numpy array with dtype=np.float64 along axis=0."
cdef Py_ssize_t i
cdef int a0 = a.shape[0], count = 0
cdef np.float64_t asum = 0, a2sum=0, amean, ai, da
for i in range(a0):
ai = a[i]
if ai == ai:
asum += ai
count += 1
if count > 0:
amean = asum / count
asum = 0
for i in range(a0):
ai = a[i]
if ai == ai:
da = ai - amean
asum += da
a2sum += (da * da)
asum = asum * asum
return sqrt((a2sum - asum / count) / (count - ddof))
else:
return np.float64(NAN)
More information about the NumPy-Discussion
mailing list