[SciPy-User] "small data" statistics

Nathaniel Smith njs at pobox.com
Fri Oct 12 07:22:16 EDT 2012


On 12 Oct 2012 09:37, "Emanuele Olivetti" <emanuele at relativita.com> wrote:
>
> On 10/11/2012 04:57 PM, josef.pktd at gmail.com wrote:
> > Most statistical tests and statistical inference in scipy.stats and
> > statsmodels relies on large number assumptions.
> >
> > Everyone is talking about "Big data", but is anyone still interested
> > in doing small sample statistics in python.
> >
> > I'd like to know whether it's worth spending any time on general
> > purpose small sample statistics.
> >
> > for example:
> >
> > http://facultyweb.berry.edu/vbissonnette/statshw/doc/perm_2bs.html
> >
> > ```
> > Example homework problem:
> > [...]
> > Shallow Processing: 13 12 11 9 11 13 14 14 14 15
> > Deep Processing: 12 15 14 14 13 12 15 14 16 17
> > ```
>
> I am very interested in inference from small samples, but I have
> some concerns about both the example and the proposed approach
> based on the permutation test.
>
> IMHO the question in the example at that URL, i.e. "Did the instructions
> given to the participants significantly affect their level of recall?" is
> not directly addressed by the permutation test.

In this sentence, the word "significantly" is a term of art used to refer
exactly to the quantity p(t>T(data)|H_0). So, yes, the permutation test
addresses the original question; you just have to be familiar with the
field's particular jargon to understand what they're saying. :-)

> The permutation test is
> related the question "how (un)likely is the collected dataset under the
> assumption that the instructions did not affect the level of recall?".
>
> In other words the initial question is about quantifying how likely is the
> hypothesis "the instructions do not affect the level of recall"
> (let's call it H_0) given the collected dataset, with respect to how
likely is the
> hypothesis "the instructions affect the level of recall" (let's call it
H_1)
> given the data. In a bit more formal notation the initial question is
about
> estimating p(H_0|data) and p(H_1|data), while the permutation test
provides
> a different quantity, which is related (see [0]) to p(data|H_0). Clearly
> p(data|H_0) is different from p(H_0|data).
> Literature on this point is for example
http://dx.doi.org/10.1016/j.socec.2004.09.033
>
> On a different side, I am also interested in understanding which are the
assumptions
> under which the permutation test is expected to work. I am not an expert
in that
> field but, as far as I know, the permutation test - and all resampling
approaches
> in general - requires that the sample is "representative" of the
underlying
> distribution of the problem. In my opinion this requirement is difficult
to assess
> in practice and it is even more troubling for the specific case of "small
data" - of
> interest for this thread.

All tests require some kind of representativeness, and this isn't really a
problem. The data are by definition representative (in the technical sense)
of the distribution they were drawn from. (The trouble comes when you want
to decide whether that distribution matches anything you care about, but
looking at the data won't tell you that.) A well designed test is one that
is correct on average across samples.

The alternative to a permutation test here is to make very strong
assumptions about the underlying distributions (e.g. with a t test), and
these assumptions are often justified only for large samples.  And,
resampling tests are computationally expensive, but this is no problem for
small samples. So that's why non parametrics are often better in this
setting.

-n

> Any comment on these points is warmly welcome.
>
> Best,
>
> Emanuele
>
> [0] A minor detail: I said "related" because the outcome of the
permutation test,
> and of classical tests for hypothesis testing in general, is not
precisely p(data|H_0).
> First of all those tests rely on a statistic of the dataset and not on
the dataset itself.
> In the example at the URL the statistic (called "criterion" there) is the
difference
> between the means of the two groups. Second and more important,
> the test provides an estimate of the probability of observing such a value
> for the statistic... "or a more extreme one". So if we call the statistic
over the
> data as T(data), then the classical tests provide p(t>T(data)|H_0), and
not
> p(data|H_0). Anyway even p(t>T(data)|H_0) is clearly different from the
initial
> question, i.e. p(H_0|data).
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20121012/48f87edb/attachment.html>


More information about the SciPy-User mailing list