[Numpy-discussion] Weighted Covariance/correlation

Mon Sep 8 12:14:22 EDT 2014

Hi all, 

Any input to this? Last time it generated a fair bit of discussion, which
I’ll summarise here. 

It’s currently possible to calculate a weighted average using np.average,
but the corresponding functionality does not exist for (co)variance or
corrcoeff calculations. In this case it’s less straightforward, and we need
to worry about what type of information the weights contain. 

Repeat type weights are the easiest to explain. Here the variances of 

[x1, x2, x3] with weights [2, 1, 3] 

and 

[x1, x1, x2, x3, x3, x3] 

are identical. For Bessel correction the total number of samples is obtained
by summing the weights. These weights do not have to be integer, and in this
case the only important assumption is that their sum represents the total
sample size. 

The second type of weights are importances or accuracies. Here the weighs
represent the relative strength of contributions from each of the associated
samples. Because this is a purely relative relation, there’s no concrete
information about the total number of samples. This has to be obtained from
the effective sample size, given by (sum(weights)^2)/sum(weights^2). 

I think the the clearest way of providing both options is to have a boolean
switch indicating if the weights represent repeats or frequency type
information. I can’t immediately see a good motivation for allowing both
concurrently, and think this could cause confusion. 

Tom 

On 15 Aug 2014, at 14:46, Sebastian Berg <[hidden email]> wrote: 

> Hi all, 
> 
> Tom Poole has opened pull request 
> https://github.com/numpy/numpy/pull/4960 to implement weights into 
> np.cov (correlation can be added), somewhat picking up the effort 
> started by Noel Dawe in https://github.com/numpy/numpy/pull/3864. 
> 
> The pull request would currently implement an accuracy type `weights` 
> keyword argument as default, but have a switch `repeat_weights` to use 
> repeat type weights instead (frequency type are a special case of this I 
> think). 
> 
> As far as I can see, the code is in a state that it can be tested. But 
> since it is a new feature, the names/defaults are up for discussion, so 
> maybe someone who might use such a feature has a preference. I know we 
> had a short discussion about this before, but it was a while ago. For 
> example another option would be to have the two weights as two keyword 
> arguments, instead of a boolean switch. 
> 
> Regards, 
> 
> Sebastian 
> 
> _______________________________________________ 
> NumPy-Discussion mailing list 
> [hidden email] 
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

--
View this message in context: http://numpy-discussion.10968.n7.nabble.com/Weighted-Covariance-correlation-tp38394p38570.html
Sent from the Numpy-discussion mailing list archive at Nabble.com.