[SciPy-User] "small data" statistics

Fri Oct 12 07:12:07 EDT 2012

On 12.10.2012 10:36, Emanuele Olivetti wrote:

> In other words the initial question is about quantifying how likely is the
> hypothesis "the instructions do not affect the level of recall"
> (let's call it H_0) given the collected dataset, with respect to how likely is the
> hypothesis "the instructions affect the level of recall" (let's call it H_1)
> given the data. In a bit more formal notation the initial question is about
> estimating p(H_0|data) and p(H_1|data), while the permutation test provides
> a different quantity, which is related (see [0]) to p(data|H_0). Clearly
> p(data|H_0) is different from p(H_0|data).

Here you must use Bayes formula :)

p(H_0|data) is proportional to p(data|H_0) * p(H_0 a priori)

The scale factor is just a constant, so you can generate samples from
p(H_0|data) simply by using a Markov chain (e.g. Gibbs sampler) to 
sample from p(data|H_0) * p(H_0 a priori).

And that is what we call "Bayesian statistics" :-)

The "classical statistics" (sometimes called "frequentist") is very 
different and deals with long-run error rates you would get if the 
experiment and data collection are repeated. In this framework is is 
meaningless to speak about p(H_0|data) or p(H_0 a priori), because H_0 
is not considered a random variable. Probabilities can only be assigned 
to random variables.

The main difference from the Bayesian approach is thus that a Bayesian 
consider the collected data fixed and H_0 random, whereas a frequentist 
consider the data random and H_0 fixed.

To a Bayesian the data are what you got and "the universal truth about 
H0" in unkown. Randomness is the uncertainty about this truth. 
Probability is a measurement of the precision or knowledge about H0.
Doing the transform p * log2(p) yields the Shannon information in bits.

To a frequentist, the data are random (i.e. collecting a new set will 
yield a different sample) and "the universal truth about H0" is fixed 
but unknown. Randomness is the process that gives you a different data 
set each time you draw a sample. It is not the uncertainty about H0.

Choosing side it is more a matter of religion than science.

Both approaches have major flaws:

* The Bayesian approach is not scale invariable. A monotonic transform 
like y = f(x) can yield a different conclusion if we analyze y instead 
of x. For example your null hypothesis can be true if you used a linear 
scale and false if you have used a log-scale. Also, the conclusion is 
dependent on your prior opinion, which can be subjective.

* The frequentist approach makes it possible to collect too much data. 
If you just collect enough data, any correlation or two-sided test will 
be significant. Obviously collecting more data should always give you 
better information, not invariably lead to a fixed conclusion. Why do 
statistics if you know the conclusion in advance?

Sturla