[SciPy-Dev] Adding t-test with unequal variances to stats.py

Junkshops junkshops at gmail.com
Sun Jun 3 02:13:33 EDT 2012


After some of the recent list mails advising checking results against R, 
I checked this code against the R t.test() calls. I hadn't bothered 
before since I had checked the t statistic and df values, and they were 
right, and as I was just calling distribution.t.sf() like every other 
t-test function assumed I should be OK. However, it turns out that the 
statistics book I used as a source had the df function wrong, believe it 
or not.

Hence I closed the old PR and opened a new one with a fix and more tests 
here: https://github.com/scipy/scipy/pull/235

Apparently the thing to do is to beg for your PRs to go into the 
upcoming version. I'm certainly willing to do so, but I was wondering if 
the scipy devs respond better to other inveiglements - say, threats of 
world destruction, under the table 'gifts', favors of a lascivious 
nature, or cake. All proposals considered.

Cheers, Gavin

On 5/23/2012 12:38 AM, Junkshops wrote:
> Hi all,
>
> I've issued a pull request (http://github.com/scipy/scipy/pull/227) 
> for a version of scipy/stats/stats.py with the following changes:
>
> 1) Adds a method for running a t-test with unequal or unknown 
> population variances. ttest_ind assumes that population variances are 
> equal.
> 2) Refactored common code in the 4 t-test methods into shared methods.
> 3) This section of code, which has variations in multiple methods, 
> looks buggy to me:
>
> d = np.mean(a,axis) - np.mean(b,axis)
> svar = ((n1-1)*v1+(n2-1)*v2) / float(df)
>
> t = d/np.sqrt(svar*(1.0/n1 + 1.0/n2))
> t = np.where((d==0)*(svar==0), 1.0, t) #define t=0/0 = 0, identical means
>
> Surely if d=0, regardless of svar, t should be set to 0, not 1. 
> Similarly, if svar = 0 then both variances are zero (assuming that 
> each data set has at least 2 points - perhaps there should be a check 
> for this?). In that case, if d==0 t should be zero. Otherwise, t 
> should be +/-inf. Hence, (svar==0) is redundant.
>
> Accordingly, I've changed the lines in all functions to be the 
> equivalent of
>
> t = np.where((d==0), 0.0, t)
>
> This handles the case where both d and svar are 0. The respective 
> tests have also been changed.
>
> If I'm missing something here, please let me know.
>
> Thanks, Gavin
>



More information about the SciPy-Dev mailing list