[Numpy-discussion] please change mean to use dtype=float

Wed Sep 20 15:02:38 EDT 2006

[Sorry, this version should have less munged formatting since I clipped 
the comments. Oh, and the Kahan sum algorithm was grabbed from 
wikipedia, not mathworld]

Tim Hochberg wrote:
> Robert Kern wrote:
>   
>> David M. Cooke wrote:
>>   
>>     
>>> On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote:
>>>     
>>>       
>>>> Let me offer a third path: the algorithms used for .mean() and .var() are 
>>>> substandard. There are much better incremental algorithms that entirely avoid 
>>>> the need to accumulate such large (and therefore precision-losing) intermediate 
>>>> values. The algorithms look like the following for 1D arrays in Python:
>>>>
>>>> def mean(a):
>>>>      m = a[0]
>>>>      for i in range(1, len(a)):
>>>>          m += (a[i] - m) / (i + 1)
>>>>      return m
>>>>       
>>>>         
>>> This isn't really going to be any better than using a simple sum.
>>> It'll also be slower (a division per iteration).
>>>     
>>>       
>> With one exception, every test that I've thrown at it shows that it's better for 
>> float32. That exception is uniformly spaced arrays, like linspace().
>>   
>>     
> Here's version of mean based on the Kahan sum, including the 
> compensation term that seems to be consistently much better than builtin 
> mean  It seems to be typically 4 orders or magnitude closer to the 
> "correct" answer than the builtin mean and 2 orders of magnitude better 
> than just naively using the Kahan sum. I'm using the sum performed with 
> dtype=int64 as the "correct" value.
>
>
>   

def rawKahanSum(values):
    if not input:
        return 0.0
    total = values[0]
    c = type(total)(0.0)       
    for x in values[1:]:
        y = x - c             
        t = total + y         
        c = (t - total) - y  
        total = t            

    return total, c

def kahanSum(values):
    total, c =  rawKahanSum(values)
    return total

def mean(values):
    total, c = rawKahanSum(values)
    n = float(len(values))
    return total /  n - c / n

for i in range(100):
    data = random.random([10000]).astype(float32)
    baseline = data.mean(dtype=float64)
    delta_builtin_mean = baseline - data.mean()
    delta_compensated_mean = baseline - mean(data)
    delta_kahan_mean = baseline - (kahanSum(data) / len(data))
    if not abs(delta_builtin_mean) >= abs(delta_kahan_mean) >= 
abs(delta_compensated_mean):
        print ">>>",
    print delta_builtin_mean, delta_kahan_mean, delta_compensated_mean

> -tim
>   
>>  > You do avoid
>>  > accumulating large sums, but then doing the division a[i]/len(a) and
>>  > adding that will do the same.
>>
>> Okay, this is true.
>>
>>   
>>     
>>> Now, if you want to avoid losing precision, you want to use a better
>>> summation technique, like compensated (or Kahan) summation:
>>>
>>> def mean(a):
>>>     s = e = a.dtype.type(0)
>>>     for i in range(0, len(a)):
>>>         temp = s
>>>         y = a[i] + e
>>>         s = temp + y
>>>         e = (temp - s) + y
>>>     return s / len(a)
>>>
>>> Some numerical experiments in Maple using 5-digit precision show that
>>> your mean is maybe a bit better in some cases, but can also be much
>>> worse, than sum(a)/len(a), but both are quite poor in comparision to the
>>> Kahan summation.
>>>
>>> (We could probably use a fast implementation of Kahan summation in
>>> addition to a.sum())
>>>     
>>>       
>> +1
>>
>>   
>>     
>>>> def var(a):
>>>>      m = a[0]
>>>>      t = a.dtype.type(0)
>>>>      for i in range(1, len(a)):
>>>>          q = a[i] - m
>>>>          r = q / (i+1)
>>>>          m += r
>>>>          t += i * q * r
>>>>      t /= len(a)
>>>>      return t
>>>>
>>>> Alternatively, from Knuth:
>>>>
>>>> def var_knuth(a):
>>>>      m = a.dtype.type(0)
>>>>      variance = a.dtype.type(0)
>>>>      for i in range(len(a)):
>>>>          delta = a[i] - m
>>>>          m += delta / (i+1)
>>>>          variance += delta * (a[i] - m)
>>>>      variance /= len(a)
>>>>      return variance
>>>>       
>>>>         
>>> These formulas are good when you can only do one pass over the data
>>> (like in a calculator where you don't store all the data points), but
>>> are slightly worse than doing two passes. Kahan summation would probably
>>> also be good here too.
>>>     
>>>       
>> Again, my tests show otherwise for float32. I'll condense my ipython log into a 
>> module for everyone's perusal. It's possible that the Kahan summation of the 
>> squared residuals will work better than the current two-pass algorithm and the 
>> implementations I give above.
>>
>>   
>>     
>
>
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys -- and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/numpy-discussion
>
>
>