[Numpy-discussion] Weighted Covariance/correlation

Sun Aug 24 16:05:45 EDT 2014

Hi all,

Any input to this? Last time it generated a fair bit of discussion, which I’ll summarise here.

It’s currently possible to calculate a weighted average using np.average, but the corresponding functionality does not exist for (co)variance or corrcoeff calculations. In this case it’s less straightforward, and we need to worry about what type of information the weights contain.

Repeat type weights are the easiest to explain. Here the variances of

[x1, x2, x3] with weights [2, 1, 3]

and

[x1, x1, x2, x3, x3, x3]

are identical. For Bessel correction the total number of samples is obtained by summing the weights. These weights do not have to be integer, and in this case the only important assumption is that their sum represents the total sample size.

The second type of weights are importances or accuracies. Here the weighs represent the relative strength of contributions from each of the associated samples. Because this is a purely relative relation, there’s no concrete information about the total number of samples. This has to be obtained from the effective sample size, given by (sum(weights)^2)/sum(weights^2).

I think the the clearest way of providing both options is to have a boolean switch indicating if the weights represent repeats or frequency type information. I can’t immediately see a good motivation for allowing both concurrently, and think this could cause confusion.

Tom 

On 15 Aug 2014, at 14:46, Sebastian Berg <sebastian at sipsolutions.net> wrote:

> Hi all,
> 
> Tom Poole has opened pull request
> https://github.com/numpy/numpy/pull/4960 to implement weights into
> np.cov (correlation can be added), somewhat picking up the effort
> started by Noel Dawe in https://github.com/numpy/numpy/pull/3864.
> 
> The pull request would currently implement an accuracy type `weights`
> keyword argument as default, but have a switch `repeat_weights` to use
> repeat type weights instead (frequency type are a special case of this I
> think).
> 
> As far as I can see, the code is in a state that it can be tested. But
> since it is a new feature, the names/defaults are up for discussion, so
> maybe someone who might use such a feature has a preference. I know we
> had a short discussion about this before, but it was a while ago. For
> example another option would be to have the two weights as two keyword
> arguments, instead of a boolean switch.
> 
> Regards,
> 
> Sebastian
> 
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion