[SciPy-User] Bug t-test for identical means with no variance?

josef.pktd at gmail.com josef.pktd at gmail.com
Fri Jul 8 19:19:55 EDT 2011


On Fri, Jul 8, 2011 at 7:06 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
> On Fri, Jul 8, 2011 at 6:51 PM,  <josef.pktd at gmail.com> wrote:
>> On Fri, Jul 8, 2011 at 6:41 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
>>> A ticket was filed [1] for ttest_ind (same issue with ttest_rel and
>>> ttest_1samp) in the case of identical means and no variance.
>>>
>>> Same means, no variance
>>>
>>> d1 = np.ones(10)
>>> d2 = np.array([1,1.])
>>> stats.ttest_ind(d1,d2)
>>> (1.0, 0.34089313230206009)
>>>
>>> Different means, no variance
>>>
>>> d1 = np.array([ 5.,  5.,  5.,  5.,  5.,  5.,  5.,  5.,  5.,  5.])
>>> d2 = np.array([ 2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.])
>>> stats.ttest_ind(d1,d2)
>>> (inf, 0.0)
>>>
>>> The first result doesn't make sense. In the code there are conflicting
>>> notes (with each other and what the code does) for catching this
>>>
>>> https://github.com/scipy/scipy/blob/master/scipy/stats/stats.py#L2873
>>> https://github.com/scipy/scipy/blob/master/scipy/stats/stats.py#L2963
>>> https://github.com/scipy/scipy/blob/master/scipy/stats/stats.py#L3044
>>>
>>> I think defining t = 0/0 to be 0 is the least wrong thing to do, but
>>> certainly not t = 0/0 as 1, which gives an arbitrary p-value depending
>>> on sample sizes. Is there an accepted definition for this case? Does
>>> returning (nan, 1.0) make more sense?
>>>
>>> Skipper
>>>
>>> [1] http://projects.scipy.org/scipy/ticket/1475
>>
>> scipy dev mailing list "changes to stats t-tests" Dec 20, 2008 for the
>> original change.
>>
>> If anyone finds a justification for the 0/0 case, ....
>>
>
> I have the same intuition as your initial thought. Setting it to 1
> *seems* aribitrary. I'd have to think more now than I have time for
> any justification though.
>
> Apologies for not searching and making noise instead,

noise is fine, even better if someone comes up with a real
justification. I went in circles in my arguments several times.

The main justification is, given that the underlying assumption is
that the samples come from normal distributions, the only way we could
observe identical values is if the variance goes to zero and we have a
degenerate normal distribution. After that I was trying to take
different limits, then ...

Or suppose the true distribution is normal, but we observe only a
(machine precision) discretized sample, ....

Or suppose we have a large sample normal approximation to some
discrete data, ... (but we only have 5 observations.)

Josef


>
> Skipper
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>



More information about the SciPy-User mailing list