[Numpy-discussion] Does np.std() make two passes through the data?
Keith Goodman
kwgoodman at gmail.com
Sun Nov 21 20:49:54 EST 2010
On Sun, Nov 21, 2010 at 4:18 PM, <josef.pktd at gmail.com> wrote:
> On Sun, Nov 21, 2010 at 6:43 PM, Keith Goodman <kwgoodman at gmail.com> wrote:
>> Does np.std() make two passes through the data?
>>
>> Numpy:
>>
>>>> arr = np.random.rand(10)
>>>> arr.std()
>> 0.3008736260967052
>>
>> Looks like an algorithm that makes one pass through the data (one for
>> loop) wouldn't match arr.std():
>>
>>>> np.sqrt((arr*arr).mean() - arr.mean()**2)
>> 0.30087362609670526
>>
>> But a slower two-pass algorithm would match arr.std():
>>
>>>> np.sqrt(((arr - arr.mean())**2).mean())
>> 0.3008736260967052
>>
>> Is there a way to get the same result as arr.std() in one pass (cython
>> for loop) of the data?
>
> reference several times pointed to on the list is the wikipedia page, e.g.
> http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm
Unfortunately it doesn't give the same result as numpy's two pass algorithm.
>From wikipedia:
def var(arr):
n = 0
mean = 0
M2 = 0
for x in arr:
n = n + 1
delta = x - mean
mean = mean + delta / n
M2 = M2 + delta * (x - mean)
return M2 / n
This set of random samples gives matching variance:
>> a = np.random.rand(100)
>> a.var()
0.07867478939716277
>> var(a)
0.07867478939716277
But this sample gives a difference:
>> a = np.random.rand(100)
>> a.var()
0.080232196646619805
>> var(a)
0.080232196646619791
As you know, I'm trying to make a drop-in replacement for
scipy.stats.nanstd. Maybe I'll have to add an asterisk to the drop-in
part. Either that, or suck it up and store the damn mean.
More information about the NumPy-Discussion
mailing list