[SciPy-Dev] Resolving PR 235: t-statistic = 0/0 case

Wed Jun 6 16:18:11 EDT 2012

Hi all,

This mail references PR 235: https://github.com/scipy/scipy/pull/235

The PR adds a method for performing a t-test on 2 samples with unequal 
or unknown variances and changes the return value when the t-statistic = 
0 / 0 to 0 from the previous code's return value, 1, for all t-test 
functions.

The latter change is a point of contention for the pull. I take the 
position that t = 0/0 should return 0, while Josef, the primary 
scipy.stats maintainer, believes that t = 0/0 should return one. 
Normally I would try to resolve this directly with Josef, but 
unfortunately I haven't heard from him in over two days, and the beta 
release is scheduled for Saturday. As such, I'm writing the dev list to 
ask what to do next.

Josef's position, to the best of my ability to understand it, is that:

J1) Adding a small amount of noise to data sets that have otherwise 
equal means (e.g. [0,0,0,0], [0,0,0,1e-100]) results in a t-statistic of 
1. Thus, as the mean difference approaches zero, t -> 1.
J2) A data set with no mean difference and no variance is a zero 
probability event. As such, returning t = 1 is reasonable, as therefore 
p = 0.317-0.5 for a two tailed test depending on the degrees of freedom, 
and hence for standard values of alpha the null hypothesis will not be 
rejected, but the user gets some feedback that his data is suspect.

I admit I don't completely understand the second argument. Hopefully 
when Josef resurfaces he can correct my representation of his argument 
if needed.

My responses to these arguments are:

J1) If you take the n-length vectors (x,...,x) and (x,...x,x+y) and 
solve for t, t = 1 for any value of x and y. This is simply due to the 
fact that the the mean difference is y/n and the pooled variance (e.g. 
the denominator of the t-statistic) is also y/n. Thus this is merely a 
special case and does not represent a limit as the mean difference 
approaches zero, since even for [0,0,0,0], [0,0,0,1e100] t = 1.

J2) Strictly speaking, if we're pulling independent random samples from 
a normal distribution over the reals, say, any given set of samples has 
zero probability since there are an infinite number of possible samples. 
The sample [0,0,0] is just as probable as the sample [2.3, 7.4, 2.1]. In 
a discrete distribution, in fact, the sample [0,0,0] is *more* likely 
than any other sample. Also, we're now analyzing the legitimacy of the 
user's data, which is not the job of the t-test. The t-test is expected 
to take a data set(s) and provide a measure of the strength of the 
evidence that the means are different. Is it not expected to comment on 
the validity of the user's data.

My arguments that 0 is the correct answer to return are:

1) if MD = mean difference and PV = pooled variance, then t = MD/PV. As 
MD -> 0 for any PV > 0, t -> 0. However, if we set t = 0/0 = 1 and MD = 
0, as PV -> 0 we introduce a discontinuity at zero since t = 0 for any 
value of PV except 0, where t = 1. This implies that for MD = PV = 0, t 
= 1, but if the variance of the dataset is infinitesimally small but 
!=0, then t = 0. To me, this makes no sense.

2) The t-test's purpose is to measure the strength of the evidence 
against the null hypothesis that MD = 0. If MD = 0, as in the case we're 
discussing, by definition there is no evidence that MD != 0. Therefore p 
must = 1, and as a consequence t must = 0.

I also consulted with a statistics professor - who agreed with my 
position - to make sure I wasn't talking out of turn.

In summary, I think that if t = 0/0 we should return 0. However, both R 
and Matlab return NaN. To me this seems incorrect due to argument #2.  
Josef also mentioned there were users of scipy.stats.stats that didn't 
want NaN return values for some reason that I don't know offhand.

How would the Scipy devs like to proceed?

Cheers, Gavin