[Numpy-discussion] Does np.std() make two passes through the data?
Benjamin Root
ben.root at ou.edu
Mon Nov 22 12:07:56 EST 2010
On Mon, Nov 22, 2010 at 11:03 AM, Keith Goodman <kwgoodman at gmail.com> wrote:
> On Sun, Nov 21, 2010 at 5:56 PM, Robert Kern <robert.kern at gmail.com>
> wrote:
> > On Sun, Nov 21, 2010 at 19:49, Keith Goodman <kwgoodman at gmail.com>
> wrote:
> >
> >> But this sample gives a difference:
> >>
> >>>> a = np.random.rand(100)
> >>>> a.var()
> >> 0.080232196646619805
> >>>> var(a)
> >> 0.080232196646619791
> >>
> >> As you know, I'm trying to make a drop-in replacement for
> >> scipy.stats.nanstd. Maybe I'll have to add an asterisk to the drop-in
> >> part. Either that, or suck it up and store the damn mean.
> >
> > The difference is less than eps. Quite possibly, the one-pass version
> > is even closer to the true value than the two-pass version.
>
> I wrote 3 cython prototype implementations of nanstd for 1d float64 arrays:
>
> >> a = np.random.rand(1000000)
>
> # numpy; doesn't take care of NaNs
> >> a.std()
> 0.28852169850186793
>
> # cython of np.sqrt(((arr - arr.mean())**2).mean())
> >> nanstd_twopass(a, ddof=0)
> 0.28852169850186798
>
> # cython of np.sqrt((arr*arr).mean() - arr.mean()**2)
> >> nanstd_simple(a, ddof=0)
> 0.28852169850187437
>
> #
> http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm
> >> nanstd_online(a, ddof=0)
> 0.28852169850187243
>
> # My target, scipy version
> >> from scipy.stats import nanstd
> >> nanstd(a, bias=True)
> 0.28852169850186798
>
> Timing:
>
> >> timeit nanstd(a, bias=True)
> 10 loops, best of 3: 27.8 ms per loop
> >> timeit a.std()
> 100 loops, best of 3: 11.5 ms per loop
> >> timeit nanstd_twopass(a, ddof=0)
> 100 loops, best of 3: 3.24 ms per loop
> >> timeit nanstd_simple(a, ddof=0)
> 1000 loops, best of 3: 1.6 ms per loop
> >> timeit nanstd_online(a, ddof=0)
> 100 loops, best of 3: 10.8 ms per loop
>
> nanstd_simple is the fastest but I assume the algorithm is no good for
> general use?
>
> I think I'll go with nanstd_twopass. It will most closely match
> numpy/scipy, is more robust than nanstd_simple, and is the second
> fastest.
>
> Here's the code. Improvements welcomed.
>
> @cython.boundscheck(False)
> @cython.wraparound(False)
> def nanstd_simple(np.ndarray[np.float64_t, ndim=1] a, int ddof):
> "nanstd of 1d numpy array with dtype=np.float64 along axis=0."
> cdef Py_ssize_t i
> cdef int a0 = a.shape[0], count = 0
> cdef np.float64_t asum = 0, a2sum=0, ai
> for i in range(a0):
> ai = a[i]
> if ai == ai:
> asum += ai
> a2sum += ai * ai
> count += 1
> if count > 0:
> asum = asum * asum
> return sqrt((a2sum - asum / count) / (count - ddof))
> else:
> return np.float64(NAN)
>
> @cython.boundscheck(False)
> @cython.wraparound(False)
> def nanstd_online(np.ndarray[np.float64_t, ndim=1] a, int ddof):
> "nanstd of 1d numpy array with dtype=np.float64 along axis=0."
> cdef Py_ssize_t i
> cdef int a0 = a.shape[0], n = 0
> cdef np.float64_t mean = 0, M2 = 0, delta, x
> for i in range(a0):
> x = a[i]
> if x == x:
> n += 1
> delta = x - mean
> mean = mean + delta / n
> M2 = M2 + delta * (x - mean)
> if n > 0:
> return np.float64(sqrt(M2 / (n - ddof)))
> else:
> return np.float64(NAN)
>
> @cython.boundscheck(False)
> @cython.wraparound(False)
> def nanstd_twopass(np.ndarray[np.float64_t, ndim=1] a, int ddof):
> "nanstd of 1d numpy array with dtype=np.float64 along axis=0."
> cdef Py_ssize_t i
> cdef int a0 = a.shape[0], count = 0
> cdef np.float64_t asum = 0, a2sum=0, amean, ai, da
> for i in range(a0):
> ai = a[i]
> if ai == ai:
> asum += ai
> count += 1
> if count > 0:
> amean = asum / count
> asum = 0
> for i in range(a0):
> ai = a[i]
> if ai == ai:
> da = ai - amean
> asum += da
> a2sum += (da * da)
> asum = asum * asum
> return sqrt((a2sum - asum / count) / (count - ddof))
> else:
> return np.float64(NAN)
>
I wonder how the results would change if the size of the array was larger
than the processor cache? I still can't seem to wrap my head around the
idea that a two-pass algorithm would be faster than a single-pass. Is this
just a big-O thing where sometimes one algorithm will be faster than the
other based on the size of the problem?
Ben Root
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20101122/45ea942b/attachment.html>
More information about the NumPy-Discussion
mailing list