[SciPy-Dev] Resolving PR 235: t-statistic = 0/0 case

Wed Jun 6 16:50:30 EDT 2012

On Wed, Jun 6, 2012 at 4:18 PM, Junkshops <junkshops at gmail.com> wrote:
> Hi all,
>
> This mail references PR 235: https://github.com/scipy/scipy/pull/235
>
> The PR adds a method for performing a t-test on 2 samples with unequal
> or unknown variances and changes the return value when the t-statistic =
> 0 / 0 to 0 from the previous code's return value, 1, for all t-test
> functions.
>
> The latter change is a point of contention for the pull. I take the
> position that t = 0/0 should return 0, while Josef, the primary
> scipy.stats maintainer, believes that t = 0/0 should return one.
> Normally I would try to resolve this directly with Josef, but
> unfortunately I haven't heard from him in over two days, and the beta
> release is scheduled for Saturday. As such, I'm writing the dev list to
> ask what to do next.
>
> Josef's position, to the best of my ability to understand it, is that:
>
> J1) Adding a small amount of noise to data sets that have otherwise
> equal means (e.g. [0,0,0,0], [0,0,0,1e-100]) results in a t-statistic of
> 1. Thus, as the mean difference approaches zero, t -> 1.
> J2) A data set with no mean difference and no variance is a zero
> probability event. As such, returning t = 1 is reasonable, as therefore
> p = 0.317-0.5 for a two tailed test depending on the degrees of freedom,
> and hence for standard values of alpha the null hypothesis will not be
> rejected, but the user gets some feedback that his data is suspect.
>
> I admit I don't completely understand the second argument. Hopefully
> when Josef resurfaces he can correct my representation of his argument
> if needed.
>
> My responses to these arguments are:
>
> J1) If you take the n-length vectors (x,...,x) and (x,...x,x+y) and
> solve for t, t = 1 for any value of x and y. This is simply due to the
> fact that the the mean difference is y/n and the pooled variance (e.g.
> the denominator of the t-statistic) is also y/n. Thus this is merely a
> special case and does not represent a limit as the mean difference
> approaches zero, since even for [0,0,0,0], [0,0,0,1e100] t = 1.
>
> J2) Strictly speaking, if we're pulling independent random samples from
> a normal distribution over the reals, say, any given set of samples has
> zero probability since there are an infinite number of possible samples.
> The sample [0,0,0] is just as probable as the sample [2.3, 7.4, 2.1]. In
> a discrete distribution, in fact, the sample [0,0,0] is *more* likely
> than any other sample. Also, we're now analyzing the legitimacy of the
> user's data, which is not the job of the t-test. The t-test is expected
> to take a data set(s) and provide a measure of the strength of the
> evidence that the means are different. Is it not expected to comment on
> the validity of the user's data.

Practically speaking, it's a bit of a stretch to assume that the data
generating process for [0,0,0] is (even approximately) normal, so I
think it is appropriate for the test to do some sanity checking.

The t-test itself is only valid given that the underlying data
satisfies the assumptions, and I don't think a constant random
variable meets the requirements.

>
> My arguments that 0 is the correct answer to return are:
>
> 1) if MD = mean difference and PV = pooled variance, then t = MD/PV. As
> MD -> 0 for any PV > 0, t -> 0. However, if we set t = 0/0 = 1 and MD =
> 0, as PV -> 0 we introduce a discontinuity at zero since t = 0 for any
> value of PV except 0, where t = 1. This implies that for MD = PV = 0, t
> = 1, but if the variance of the dataset is infinitesimally small but
> !=0, then t = 0. To me, this makes no sense.
>
> 2) The t-test's purpose is to measure the strength of the evidence
> against the null hypothesis that MD = 0. If MD = 0, as in the case we're
> discussing, by definition there is no evidence that MD != 0. Therefore p
> must = 1, and as a consequence t must = 0.
>
> I also consulted with a statistics professor - who agreed with my
> position - to make sure I wasn't talking out of turn.
>
> In summary, I think that if t = 0/0 we should return 0. However, both R
> and Matlab return NaN. To me this seems incorrect due to argument #2.
> Josef also mentioned there were users of scipy.stats.stats that didn't
> want NaN return values for some reason that I don't know offhand.
>
> How would the Scipy devs like to proceed?
>

Until I see any math or a reference, I think returning NaN is the path
of least resistance.

My $.02,

Skipper