numpy/scipy: error of correlation coefficient (clumpy data)

Thu Nov 16 10:42:20 EST 2006

robert wrote:

> Think of such example: A drunken (x,y) 2D walker is supposed to walk along a diagonal, but he makes frequent and unpredictable pauses/slow motion. You get x,y coordinates in 1 per second. His speed and time pattern at all do not matter - you just want to know how well he keeps his track.

In which case you have time series data, i.e. regular samples from p(t)
= [ x(t), y(t) ]. Time series have some sort of autocorrelation in the
samples as well, which must be taken into account. Even tough you could
weight each point by the drunkard's speed, a correlation or linear
regression would still not make any sense here, as such analyses are
based on the assumption of no autocorrelation in the samples or the
residuals. Correlation has no meaning if y[t] is correlated with
y[t+1], and regression has no meaning if the residual e[t] is
correlated with the residual e[t+1].

A state space model could e.g. be applicable. You could estimate the
path of the drunkard using a Kalman filter to compute a Taylor series
expansion p(t) = p0 + v*t + 0.5*a*t**2 + ... for the path at each step
p(t). When you have estimates for the state parameters s, v, and a, you
can compute some sort of measure for the drunkard's deviation from his
ideal path.

However, if you don't have time series data, you should not treat your
data as such.

If you don't know how your data is generated, there is no way to deal
with them correctly. If the samples are time series they must be
threated as such, if they are not they should not. If the samples are
i.i.d. each point count equally much, if they are not they do not. If
you have a clumped data due to time series or lack of i.i.d., you must
deal with that. However, data can be i.i.d. and clumped, if the
underlying distribution is clumped. In order to determine the cause,
you must consider how your data are generated and how your data are
sampled. You need meta-information about your data to determine this.
Matlab or Octave will help you with this, and it is certainly not a
weakness of NumPy as you implied in your original post. There is no way
to put magic into any numerical computation. Statistics always require
formulation of specific assumptions about the data. If you cannot think
clearly about your data, then that is the problem you must solve.