[SciPy-User] stats, classes instead of functions for results MovStats

Mon Nov 30 16:43:55 EST 2009

On Mon, Nov 30, 2009 at 4:05 PM, Bruce Southey <bsouthey at gmail.com> wrote:
> On 11/22/2009 11:43 PM, josef.pktd at gmail.com wrote:
>
> Following up on a question by Keith on the numpy list and his reminder
> that covariance can be calculated by the cross-product minus the
> product of the means, I redid and
> enhanced my moving stats functions.
>
> Suppose x and y are two time series, then the moving correlation
> requires the calculation of the mean, variance and covariance for each
> window. Currently in scipy stats intermediate results are usually
> thrown away on return (while rpy/R returns all intermediate results
> used for the calculation.
>
> Using a decorator/descriptor of Fernando written for nitime, I tried
> out to write the function as a class instead, so that any desired (
> intermediate) calculations are only made on demand, but once they are
> calculated they are attached to the class as attributes or properties.
> This seems to be a useful "pattern".
>
> Are there any opinion for using the pattern in scipy.stats ? MovStats
> will currently go into statsmodels
>
> Below is the class (with cutting part of init), a full script is the
> attachment, including examples that test the class.
>
> about MovStats:
> y and x are tested for 2d, either (T,N) with axis=0 or (N,T) with
> axis=1, should (but may not yet) work for nd arrays along any axis
> (signal.correlate docstring)
> nans are handled by dropping the corresponding observations from the
> window, not adding any additional observations,
> not tested if a window is empty because it contains only nans, nor if
> variance is zero
> (kern is intended for weighted statistics in the window but not tested
> yet, I still need to decide on normalization requirements)
> requires scipy.signal, all calculations done with signal.correlate, no loops
> as often, functions are one-liners
> all results are returned for valid observations only, initial
> observations with incomplete window are cut
> bonus: slope of moving regression of y on x, since it was trivial to add
> still some cleaning and documentation to do
>
> usage:
> ms = MovStats(x, y, axis=1)
> ms.yvar
> ms.xmean
> ms.yxcorr
> ms.yxcov
> ...
>
>
> Josef
>
> class MovStats(object):
>     def __init__(self, y, x=None, kern=5, axis=0):
>         self.y = y
>         self.x = x
>         if np.isscalar(kern):
>             ws = kern
> <... snip>
>
>     @OneTimeProperty
>     def ymean(self):
>         ys = signal.correlate(self.y, self.kern, mode='same')[self.sslice]
>         ym = ys/self.n
>         return ym
>
>     @OneTimeProperty
>     def yvar(self):
>         ys2 = signal.correlate(self.y*self.y, self.kern,
> mode='same')[self.sslice]
>         yvar = ys2/self.n  - self.ymean**2
>         return yvar
>
>     @OneTimeProperty
>     def xmean(self):
>         if self.x is None:
>             return None
>         else:
>             xs = signal.correlate(self.x, self.kern,
> mode='same')[self.sslice]
>             xm = xs/self.n
>             return xm
>
>     @OneTimeProperty
>     def xvar(self):
>         if self.x is None:
>             return None
>         else:
>             xs2 = signal.correlate(self.x*self.x, self.kern,
> mode='same')[self.sslice]
>             xvar = xs2/self.n  - self.xmean**2
>             return xvar
>     @OneTimeProperty
>     def yxcov(self):
>         xys = signal.correlate(self.x*self.y, self.kern,
> mode='same')[self.sslice]
>         return xys/self.n - self.ymean*self.xmean
>
>     @OneTimeProperty
>     def yxcorr(self):
>         return self.yxcov/np.sqrt(self.yvar*self.xvar)
>
>     @OneTimeProperty
>     def yxslope(self):
>         return self.yxcov/self.xvar
>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>

Thanks for checking

> I think your handling of NaN's is incorrect because you do not drop the
> corresponding observations. That is for two arrays
>
> y=np.array([[ 1.229563, -0.339428,  0.83891 ,  4.026574,  3.069378,  5.95668
> ]])
> x=np.array([[-1.236469,  1.941089, -0.346566, -0.268529,     np.nan,
> 0.191336]])
>
> For a windows size of 5, in the first window, the first mean and variance of
> y should use all 5 elements of y, the mean and variance of X should use the
> first 4 elements of x and, the regression and correlation coefficients
> should use the first 4 elements of x and y.

What I do currently is a compromise, I don't want to calculate mean
and variance twice. So the behavior now is, if only one array is
given, then you get the mean and variance dropping the nan
observations for that array. If two arrays are given then I drop
observations in both arrays if either one has a nan. This way the user
can choose whether they want the separate calculation.
If a user provides two arrays, my assumption is that they want cov and corr.

>
>
> Some other points:
> 1) Your calculation of variance is susceptible to errors, see
> http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
> Provided that you are using sufficient precision (like numpy defaults) it is
> probably not that big a problem.

Yes, I'm aware of this, for the example the difference to np.corrcoeff
is around
1e-14. I will add a warning to the docstring that this is designed for speed
with some precision loss. I might be able to use some preprocessing to at least
treat some badly scaled data, but the usual higher numerical precision ways
of calculating would require much slower loops. For reasonably short windows
this seems to be an acceptable tradeoff.

> 2) You only use the 'full' windows so when the window width is 5, you miss
> the first two windows and the last 2 windows. At least the mean exists in
> these windows and the variance in most of these partial windows. This may
> provided unexpected results to a user if they do not release which windows
> are not returned.

Currently I'm returning "valid" observations, that have a full window.
In a previous
version I allowed for a lag, lead, centered option, Keith returns nans, I think
scikits timeseries masks. I don't know yet which or whether these options
should be included in  this function (class).

> 3) I think the user needs to define the kern argument for your MovStat class
> as there is probably no meaningful default value (except 42).

at least the window length should be specified by the user. I picked 5 for
business week mostly arbitrary to reduce typing (?)
In a slightly updated version I switched to convolution instead of correlation
to have the correct orientation for e.g. exponential weights. But I haven't
tested this yet

> 4) I do not know how you should handle positive and negative infinity.

I haven't thought about this, but since most of the time I consider inf as
a valid number, I think, it will return infs in the corresponding windows.
With a masked array, infs could be masked, but for regular arrays, I don't
want to convert infs to nans.
However, I still have to figure out some corner cases, e.g. no valid
observations
in a window, windows with zero variance. it would also be possible to
require a minimum of valid observations per window.

> 5) Your code expects at least 2 dimensions so 1-d arrays fail because you
> can not do this assignment 'kdim[axis] = ws' with 1-d arrays.

Thanks, I have tested only the 2d case. I guess I have to review
general axis handling again

Josef

>
> Bruce
>
>
>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>