[Numpy-discussion] Does np.std() make two passes through the data?

Sun Nov 21 21:33:57 EST 2010

On Sun, Nov 21, 2010 at 5:56 PM, Robert Kern <robert.kern at gmail.com> wrote:
> On Sun, Nov 21, 2010 at 19:49, Keith Goodman <kwgoodman at gmail.com> wrote:
>
>> But this sample gives a difference:
>>
>>>> a = np.random.rand(100)
>>>> a.var()
>>   0.080232196646619805
>>>> var(a)
>>   0.080232196646619791
>>
>> As you know, I'm trying to make a drop-in replacement for
>> scipy.stats.nanstd. Maybe I'll have to add an asterisk to the drop-in
>> part. Either that, or suck it up and store the damn mean.
>
> The difference is less than eps. Quite possibly, the one-pass version
> is even closer to the true value than the two-pass version.

Good, it passes the Kern test.

Here's an even more robust estimate:

>> var(a - a.mean())
   0.080232196646619819

Which is better, numpy's two pass or the one pass on-line method?

>> test()
NumPy error: 9.31135e-18
Nanny error: 6.5745e-18  <-- One pass wins!

def test(n=100000):
    numpy = 0
    nanny = 0
    for i in range(n):
        a = np.random.rand(10)
        truth = var(a - a.mean())
        numpy += np.absolute(truth - a.var())
        nanny += np.absolute(truth - var(a))
    print 'NumPy error: %g' % (numpy / n)
    print 'Nanny error: %g' % (nanny / n)
    print