[SciPy-User] "small data" statistics

Fri Oct 12 11:01:23 EDT 2012

On 12.10.2012 16:21, Emanuele Olivetti wrote:

> 1) In this thread people expressed interest in making hypothesis testing
> from small samples, so is permutation test addressing the question of
> the accompanying motivating example? In my opinion it is not and I hope I
> provided brief but compelling motivation to support this point of view.

For the problem Josef described, I'd analyze that as a two-sample 
goodness-of-fit test against a common bin(20,p) distribution.

> 2) What are the assumptions under which the permutation test is
> valid/acceptable (independently from the accompanying motivating example)?
> I have looked around on this topic but I had just found generic desiderata for
> all resampling approaches, i.e. that the sample should be "representative"
> of the underlying distribution - whatever this means in practical terms.

Ronald A. Fisher considered the permutation test to be the "exact 
procedure" the t-test should approximate. It has, in fact, all the 
assumptions of the t-test.

Surprisingly many think the t-test assume normally distributed data. It 
does not. If you have this idea too, forget it please.

The t-test only asserts that the large-sample "sampling distribution of 
the mean" (i.e. the mean you calculate, not the data point themselves) 
is a normal distribution. This is due to the central limit theorem. If 
you collect enough data, the distribution of the sample mean will 
converge towards a normal distribution. That is a mathematical 
necessity, and can be proven to always be the case. But with small data 
samples, the sampling distribution of the mean can deviate from a normal 
distribution. That is when we need to use the permutation test instead.

I.e.: The t-test is an approximation to the permutation test for "large 
enough" data samples.

What we mean by "large enough" is another story. We can e.g. estimate 
the sampling distribution of the mean using Efron's bootstrap, and run a 
goodness-of-fit test. What most practitioners do, though, is to check if 
their data is approximately normally distributed. That usually signifies 
a lack of understanding for the t-test. They think the data must be 
normal. The data do not. But if the data are normally distributed we can 
be sure the sample mean is normal as well.

So under what circumstances are the assumptions for the permutation test 
not satisfied?

One notable example is the Behrens-Fisher problem! That is, you want to 
compare the expectancy value of two distributions with different 
variance. The permutation test does not help to solve this problem any 
more than the t-test does. This is clearly a situation where 
distributions matter, showing that the permutation test is not a 
"distribution free" test.

Sturla