[SciPy-Dev] Resolving PR 235: t-statistic = 0/0 case

Wed Jun 6 17:37:45 EDT 2012

On Wed, Jun 6, 2012 at 10:18 PM, Junkshops <junkshops at gmail.com> wrote:
> Hi Nathaniel,
>
> At the outset, I'll just say that if the consensus is that we should
> return NaN, I'll accept that. I'll still try and argue my case though.
>
>> My R seems to throw an exception whenever the variance is zero
>> (regardless of the mean difference), not return NaN:
> Sorry, yes, that's correct.
>
>> Like any parametric test, the t-test only makes sense under some kind
>> of (at least approximate) assumptions about the data generating
>> process. When the sample variance is 0, then those assumptions are
>> clearly violated,
> So this seems similar to argument J2, and I still don't understand it.
> Let's say we assume our population data is normally distributed and we
> take three samples from the population and get [1,1,1]. How does that
> prove our assumption is incorrect? It's certainly possible to pull the
> same number three times from a normal distribution.

Well, no, it isn't possible really -- taking n IID samples from a
normal distribution and getting exactly the same number twice is an
event that has probability zero. (Note that this is a stronger
statement than the one you were comparing it to, that getting any
specific vector of samples has probability zero. There are infinitely
many vectors which contain at least one duplicate; the set of all of
them collectively has probability zero.) OTOH there are many other
common processes which do produce such samples, and in practice those
are where such samples come from.

>> and it doesn't seem appropriate to me to start
>> making up numbers according to some other rule that we hope might give
>> some sort-of appropriate result ("In the face of ambiguity, refuse the
>> temptation to guess."). So I actually like the R/Matlab option of
>> throwing an exception or returning NaN.
>
> Well, we're not making up numbers here - we absolutely know the means
> are the same. Hence p  = 1 and t = 0.

That is not what p=1 and t=0 mean in any kind of frequentist test. If
the means are the same (the null hypothesis is true), then p should be
a sample from a uniform distribution and t a sample from a t
distribution.

And even if we pretend that such data actually did come from a normal
distribution (which is never going to be what actually happened in
practice), then we actually still don't have any evidence that the
means are the same. What we know is that the difference in the means
is too small for us to measure, and also that the variance is too
small for us to measure. But we don't know their relative sizes. It
sounds like you're mixing up substantative and significant differences
-- certainly no-one cares if their means differ by 10^-20, but if you
somehow had an instrument that could produce measurements accurate to
10^-30, than a 10^-20 difference could easily be statistically
significant.

Whatever R does is usually based on the consensus of a bunch of really
picky, opinionated, professional statisticians; I think they have a
good reason for making the choice they did here.

-N