[Numpy-discussion] Adding weights to cov and corrcoef

Thu Mar 6 15:02:53 EST 2014

On Thu, Mar 6, 2014 at 2:51 PM, Nathaniel Smith <njs at pobox.com> wrote:
> On Wed, Mar 5, 2014 at 4:45 PM, Sebastian Berg
> <sebastian at sipsolutions.net> wrote:
>>
>> Hi all,
>>
>> in Pull Request https://github.com/numpy/numpy/pull/3864 Neol Dawe
>> suggested adding new parameters to our `cov` and `corrcoef` functions to
>> implement weights, which already exists for `average` (the PR still
>> needs to be adapted).
>>
>> The idea right now would be to add a `weights` and a `frequencies`
>> keyword arguments to these functions.
>>
>> In more detail: The situation is a bit more complex for `cov` and
>> `corrcoef` than `average`, because there are different types of weights.
>> The current plan would be to add two new keyword arguments:
>>   * weights: Uncertainty weights which causes `N` to be recalculated
>>     accordingly (This is R's `cov.wt` default I believe).
>>   * frequencies: When given, `N = sum(frequencies)` and the values
>>     are weighted by their frequency.
>
> I don't understand this description at all. One them recalculates N,
> and the other sets N according to some calculation?
>
> Is there a standard reference on how these are supposed to be
> interpreted? When you talk about per-value uncertainties, I start
> imagining that we're trying to estimate a population covariance given
> a set of samples each corrupted by independent measurement noise, and
> then there's some natural hierarchical Bayesian model one could write
> down and get an ML estimate of the latent covariance via empirical
> Bayes or something. But this requires a bunch of assumptions and is
> that really what we want to do? (Or maybe it collapses down into
> something simpler if the measurement noise is gaussian or something?)

I think the idea is that if you write formulas involving correlation
or covariance using matrix notation, then these formulas can be
generalized in several different ways by inserting some non-negative
or positive diagonal matrices into the formulas in various places.
The diagonal entries could be called 'weights'.  If they are further
restricted to sum to 1 then they could be called 'frequencies'.  Or
maybe this is too cynical and the jargon has a more standard meaning
in this context.