[SciPy-Dev] Resolving PR 235: t-statistic = 0/0 case

Skipper Seabold jsseabold at gmail.com
Wed Jun 6 17:35:58 EDT 2012


On Wed, Jun 6, 2012 at 5:18 PM, Junkshops <junkshops at gmail.com> wrote:
> Hi Nathaniel,
>
> At the outset, I'll just say that if the consensus is that we should
> return NaN, I'll accept that. I'll still try and argue my case though.
>
>> My R seems to throw an exception whenever the variance is zero
>> (regardless of the mean difference), not return NaN:
> Sorry, yes, that's correct.
>
>> Like any parametric test, the t-test only makes sense under some kind
>> of (at least approximate) assumptions about the data generating
>> process. When the sample variance is 0, then those assumptions are
>> clearly violated,
> So this seems similar to argument J2, and I still don't understand it.
> Let's say we assume our population data is normally distributed and we
> take three samples from the population and get [1,1,1]. How does that
> prove our assumption is incorrect? It's certainly possible to pull the
> same number three times from a normal distribution.
>

How do you justify that 3 empirical observations [1,1,1] come from a
normal distribution? If you have enough data for the central limit
theorem to come into play, and your variance is still 0, this is so
unlikely that I think the consequences of *possibly* incorrectly
returning NaN here would be small. If you're simulating data from a
known distribution, take another draw...

>> and it doesn't seem appropriate to me to start
>> making up numbers according to some other rule that we hope might give
>> some sort-of appropriate result ("In the face of ambiguity, refuse the
>> temptation to guess."). So I actually like the R/Matlab option of
>> throwing an exception or returning NaN.
>
> Well, we're not making up numbers here - we absolutely know the means
> are the same. Hence p  = 1 and t = 0.

But what we don't know is if the test is even appropriate, so why not
be cautious and return NaN. It's very easy for a user to make the
decision that NaN implies p = 1, if that's what you want to have.

This doesn't seem to be of all that much practical importance. In what
situation do you expect this to really matter?

Skipper



More information about the SciPy-Dev mailing list