[Numpy-discussion] Adding weights to cov and corrcoef

Fri Mar 7 10:39:08 EST 2014

On Fri, Mar 7, 2014 at 12:06 AM,  <josef.pktd at gmail.com> wrote:
> On Thu, Mar 6, 2014 at 2:51 PM, Nathaniel Smith <njs at pobox.com> wrote:
>> On Wed, Mar 5, 2014 at 4:45 PM, Sebastian Berg
>> <sebastian at sipsolutions.net> wrote:
>>>
>>> Hi all,
>>>
>>> in Pull Request https://github.com/numpy/numpy/pull/3864 Neol Dawe
>>> suggested adding new parameters to our `cov` and `corrcoef` functions to
>>> implement weights, which already exists for `average` (the PR still
>>> needs to be adapted).
>>>
>>> The idea right now would be to add a `weights` and a `frequencies`
>>> keyword arguments to these functions.
>>>
>>> In more detail: The situation is a bit more complex for `cov` and
>>> `corrcoef` than `average`, because there are different types of weights.
>>> The current plan would be to add two new keyword arguments:
>>>   * weights: Uncertainty weights which causes `N` to be recalculated
>>>     accordingly (This is R's `cov.wt` default I believe).
>>>   * frequencies: When given, `N = sum(frequencies)` and the values
>>>     are weighted by their frequency.
>>
>> I don't understand this description at all. One them recalculates N,
>> and the other sets N according to some calculation?
>>
>> Is there a standard reference on how these are supposed to be
>> interpreted? When you talk about per-value uncertainties, I start
>> imagining that we're trying to estimate a population covariance given
>> a set of samples each corrupted by independent measurement noise, and
>> then there's some natural hierarchical Bayesian model one could write
>> down and get an ML estimate of the latent covariance via empirical
>> Bayes or something. But this requires a bunch of assumptions and is
>> that really what we want to do? (Or maybe it collapses down into
>> something simpler if the measurement noise is gaussian or something?)
>
> In general, going mostly based on Stata
>
> frequency weights are just a shortcut if you have repeated
> observations. In my unit tests, the results is the same as using
> np.repeat IIRC. The total number of observation is the sum of weights.
>
> aweights and pweights are mainly like weights in WLS, reflecting the
> uncertainty of each observation. The number of observations is equal
> to the number of rows. (Stata internally rescales the weights)
> one explanation is that observations are measured with different
> noise, another that observations represent the mean of subsamples with
> different number of observations.
>
> there is an additional degrees of freedom correction in one of the
> proposed calculations modeled after other packages that I never
> figured out.

I found the missing proof

http://stats.stackexchange.com/questions/47325/bias-correction-in-weighted-variance

Josef

>
> (aside: statsmodels does not normalize the scale in WLS, in contrast
> to Stata, and it is now equivalent to GLS with diagonal sigma. The
> meaning of weight=1 depends on the user. nobs is number of rows.)
>
> no Bayesian analysis involved. but I guess someone could come up with
> a Bayesian interpretation.
>
> I think the two proposed weight types, weights and frequencies, should
> be able to handle almost all cases.
>
> Josef
>
>>
>> -n
>>
>> --
>> Nathaniel J. Smith
>> Postdoctoral researcher - Informatics - University of Edinburgh
>> http://vorpus.org
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion