[SciPy-Dev] Resolving PR 235: t-statistic = 0/0 case

Wed Jun 6 16:48:26 EDT 2012

On Wed, Jun 6, 2012 at 9:18 PM, Junkshops <junkshops at gmail.com> wrote:
> Hi all,
>
> This mail references PR 235: https://github.com/scipy/scipy/pull/235
>
> The PR adds a method for performing a t-test on 2 samples with unequal
> or unknown variances and changes the return value when the t-statistic =
> 0 / 0 to 0 from the previous code's return value, 1, for all t-test
> functions.
>
> The latter change is a point of contention for the pull. I take the
> position that t = 0/0 should return 0, while Josef, the primary
> scipy.stats maintainer, believes that t = 0/0 should return one.
> Normally I would try to resolve this directly with Josef, but
> unfortunately I haven't heard from him in over two days, and the beta
> release is scheduled for Saturday. As such, I'm writing the dev list to
> ask what to do next.
>
> Josef's position, to the best of my ability to understand it, is that:
>
> J1) Adding a small amount of noise to data sets that have otherwise
> equal means (e.g. [0,0,0,0], [0,0,0,1e-100]) results in a t-statistic of
> 1. Thus, as the mean difference approaches zero, t -> 1.
> J2) A data set with no mean difference and no variance is a zero
> probability event. As such, returning t = 1 is reasonable, as therefore
> p = 0.317-0.5 for a two tailed test depending on the degrees of freedom,
> and hence for standard values of alpha the null hypothesis will not be
> rejected, but the user gets some feedback that his data is suspect.
>
> I admit I don't completely understand the second argument. Hopefully
> when Josef resurfaces he can correct my representation of his argument
> if needed.
>
> My responses to these arguments are:
>
> J1) If you take the n-length vectors (x,...,x) and (x,...x,x+y) and
> solve for t, t = 1 for any value of x and y. This is simply due to the
> fact that the the mean difference is y/n and the pooled variance (e.g.
> the denominator of the t-statistic) is also y/n. Thus this is merely a
> special case and does not represent a limit as the mean difference
> approaches zero, since even for [0,0,0,0], [0,0,0,1e100] t = 1.
>
> J2) Strictly speaking, if we're pulling independent random samples from
> a normal distribution over the reals, say, any given set of samples has
> zero probability since there are an infinite number of possible samples.
> The sample [0,0,0] is just as probable as the sample [2.3, 7.4, 2.1]. In
> a discrete distribution, in fact, the sample [0,0,0] is *more* likely
> than any other sample. Also, we're now analyzing the legitimacy of the
> user's data, which is not the job of the t-test. The t-test is expected
> to take a data set(s) and provide a measure of the strength of the
> evidence that the means are different. Is it not expected to comment on
> the validity of the user's data.
>
> My arguments that 0 is the correct answer to return are:
>
> 1) if MD = mean difference and PV = pooled variance, then t = MD/PV. As
> MD -> 0 for any PV > 0, t -> 0. However, if we set t = 0/0 = 1 and MD =
> 0, as PV -> 0 we introduce a discontinuity at zero since t = 0 for any
> value of PV except 0, where t = 1. This implies that for MD = PV = 0, t
> = 1, but if the variance of the dataset is infinitesimally small but
> !=0, then t = 0. To me, this makes no sense.
>
> 2) The t-test's purpose is to measure the strength of the evidence
> against the null hypothesis that MD = 0. If MD = 0, as in the case we're
> discussing, by definition there is no evidence that MD != 0. Therefore p
> must = 1, and as a consequence t must = 0.
>
> I also consulted with a statistics professor - who agreed with my
> position - to make sure I wasn't talking out of turn.
>
> In summary, I think that if t = 0/0 we should return 0. However, both R
> and Matlab return NaN. To me this seems incorrect due to argument #2.
> Josef also mentioned there were users of scipy.stats.stats that didn't
> want NaN return values for some reason that I don't know offhand.

I won't comment on the question of the urgency of this, but have some
thoughts on the behavior.

My R seems to throw an exception whenever the variance is zero
(regardless of the mean difference), not return NaN:

> t.test(c(1, 1, 1), c(2, 2, 2))
Error in t.test.default(c(1, 1, 1), c(2, 2, 2)) :
  data are essentially constant

I find the arguments about "convergence" really uncompelling. If you
do a t-test between two equal, constant vectors with a tiny amount of
noise added, then the t statistic does not converge on anything -- it
has a t distribution!

> almost.ones <- function() { c(1, 1, 1) + 1e-10 * rnorm(3) }
> replicate(5, t.test(almost.ones(), almost.ones())$statistic)
         t          t          t          t          t
-3.1954354  0.7893332  0.1240638 -0.6432737 -0.6349522

Like any parametric test, the t-test only makes sense under some kind
of (at least approximate) assumptions about the data generating
process. When the sample variance is 0, then those assumptions are
clearly violated, and it doesn't seem appropriate to me to start
making up numbers according to some other rule that we hope might give
some sort-of appropriate result ("In the face of ambiguity, refuse the
temptation to guess."). So I actually like the R/Matlab option of
throwing an exception or returning NaN.

-N