[SciPy-User] "small data" statistics
Sturla Molden
sturla at molden.no
Fri Oct 12 07:12:07 EDT 2012
On 12.10.2012 10:36, Emanuele Olivetti wrote:
> In other words the initial question is about quantifying how likely is the
> hypothesis "the instructions do not affect the level of recall"
> (let's call it H_0) given the collected dataset, with respect to how likely is the
> hypothesis "the instructions affect the level of recall" (let's call it H_1)
> given the data. In a bit more formal notation the initial question is about
> estimating p(H_0|data) and p(H_1|data), while the permutation test provides
> a different quantity, which is related (see [0]) to p(data|H_0). Clearly
> p(data|H_0) is different from p(H_0|data).
Here you must use Bayes formula :)
p(H_0|data) is proportional to p(data|H_0) * p(H_0 a priori)
The scale factor is just a constant, so you can generate samples from
p(H_0|data) simply by using a Markov chain (e.g. Gibbs sampler) to
sample from p(data|H_0) * p(H_0 a priori).
And that is what we call "Bayesian statistics" :-)
The "classical statistics" (sometimes called "frequentist") is very
different and deals with long-run error rates you would get if the
experiment and data collection are repeated. In this framework is is
meaningless to speak about p(H_0|data) or p(H_0 a priori), because H_0
is not considered a random variable. Probabilities can only be assigned
to random variables.
The main difference from the Bayesian approach is thus that a Bayesian
consider the collected data fixed and H_0 random, whereas a frequentist
consider the data random and H_0 fixed.
To a Bayesian the data are what you got and "the universal truth about
H0" in unkown. Randomness is the uncertainty about this truth.
Probability is a measurement of the precision or knowledge about H0.
Doing the transform p * log2(p) yields the Shannon information in bits.
To a frequentist, the data are random (i.e. collecting a new set will
yield a different sample) and "the universal truth about H0" is fixed
but unknown. Randomness is the process that gives you a different data
set each time you draw a sample. It is not the uncertainty about H0.
Choosing side it is more a matter of religion than science.
Both approaches have major flaws:
* The Bayesian approach is not scale invariable. A monotonic transform
like y = f(x) can yield a different conclusion if we analyze y instead
of x. For example your null hypothesis can be true if you used a linear
scale and false if you have used a log-scale. Also, the conclusion is
dependent on your prior opinion, which can be subjective.
* The frequentist approach makes it possible to collect too much data.
If you just collect enough data, any correlation or two-sided test will
be significant. Obviously collecting more data should always give you
better information, not invariably lead to a fixed conclusion. Why do
statistics if you know the conclusion in advance?
Sturla
More information about the SciPy-User
mailing list