numpy/scipy: error of correlation coefficient (clumpy data)

Wed Nov 15 20:35:18 EST 2006

robert wrote:

> here the bootstrap test will as well tell us, that the confidence intervall narrows down by a factor ~sqrt(10) - just the same as if there would be 10-fold more of well distributed "new" data. Thus this kind of error estimation has no reasonable basis for data which is not very good.

The confidence intervals narrows when the amount of independent data
increases. If you don't understand why, then you lack a basic
understanding of statistics. Particularly, it is a fundamental
assumption in most statistical models that the data samples are
"IDENTICALLY AND INDEPENDENTLY DISTRIBUTED", often abbreviated "i.i.d."
And it certainly is assumed in this case. If you tell the computer (or
model) that you have i.i.d. data, it will assume it is i.i.d. data,
even when its not. The fundamental law of computer science also applies
to statistics: shit in = shit out. If you nevertheless provide data
that are not i.i.d., like you just did, you will simply obtain invalid
results.

The confidence interval concerns uncertainty about the value of a
population parameter, not about the spread of your data sample. If you
collect more INDEPENDENT data, you know more about the population from
which the data was sampled. The confidence interval has the property
that it will contain the unknown "true correlation" 95% of the times it
is generated. Thus if you two samples WITH INDEPENDENT DATA from the
same population, one small and one large, the large sample will
generate a narrower confidence interval. Computer intensive methods
like bootstrapping and asymptotic approximations derived analytically
will behave similarly in this respect. However, if you are dumb enough
to just provide duplications of your data, the computer is dumb enough
to accept that they are obtained statistically independently. In
statistical jargon this is called "pseudo-sampling", and is one of the
most common fallacies among uneducated practitioners.

Statistical software doesn't prevent the practitioner from shooting
himself in the leg; it actually makes it a lot easier. Anyone can paste
data from Excel into SPSS and hit "ANOVA" in the menu. Whether the
output makes any sense is a whole other story. One can duplicate each
sample three or four times, and SPSS would be ignorant of that fact. It
cannot guess that you are providing it with crappy data, and prevent
you from screwing up your analysis. The same goes for NumPy code. The
statistical formulas you type in Python have certain assumptions, and
when they are violated the output is of no value. The more severe the
violation, the less valuable is the output.

> The interesting task is probably this: to check for linear correlation but "weight clumping of data" somehow for the error estimation.

If you have a pathological data sample, then you need to specify your
knowledge in greater detail. Can you e.g. formulate a reasonable
stochastic model for your data, fit the model parameters using the
data, and then derive the correlation analytically?

I am beginning to think your problem is ill defined because you lack a
basic understanding of maths and statistics. For example, it seems you
were confusing numerical error (rounding and truncation error) with
statistical sampling error, you don't understand why standard errors
decrease with sample size, you are testing with pathological data, you
don't understand the difference between independent data and data
duplications, etc. You really need to pick up a statistics textbook and
do some reading, that's my advice.