[SciPy-User] stats.chisquare issues

Bruce Southey bsouthey at gmail.com
Mon Sep 27 15:41:36 EDT 2010


  On 09/26/2010 03:17 PM, josef.pktd at gmail.com wrote:
> On Sun, Sep 26, 2010 at 3:02 PM, Gökhan Sever<gokhansever at gmail.com>  wrote:
>> Hello,
>> Consider these examples:
>> I[35]: np.histogram(ydata, bins=6)
>> O[35]:
>> (array([4, 1, 3, 0, 0, 1]),
>>   array([   2.8       ,  146.33333333,  289.86666667,  433.4       ,
>>          576.93333333,  720.46666667,  864.        ]))
>> I[36]: np.histogram(ypred, bins=6)
>> O[36]:
>> (array([4, 2, 2, 0, 0, 1]),
>>   array([  22.08895   ,  166.34439167,  310.59983333,  454.855275  ,
>>          599.11071667,  743.36615833,  887.6216    ]))
>> I[45]: stats.chisquare([4, 1, 3, 0, 0, 1], [4, 2, 2, 0, 0,
>> 1])---------------------------------------------------------------------------
>> AttributeError                            Traceback (most recent call last)
>> /home/gsever/Desktop/<ipython console>  in<module>()
>> /usr/lib/python2.6/site-packages/scipy/stats/stats.pyc in chisquare(f_obs,
>> f_exp, ddof)
>>     2516     if f_exp is None:
>>     2517         f_exp = array([np.sum(f_obs,axis=0)/float(k)] *
>> len(f_obs),float)
>> ->  2518     f_exp = f_exp.astype(float)
>>     2519     chisq = np.add.reduce((f_obs-f_exp)**2 / f_exp)
>>     2520     return chisq, chisqprob(chisq, k-1-ddof)
>> AttributeError: 'list' object has no attribute 'astype'
>> Here, I expect any scipy function including chisquare should be able to
>> handle lists???
>> ############################################
>> This one throws:
>> I[46]: stats.chisquare(np.array([4, 1, 3, 0, 0, 1]), np.array([4, 2, 2, 0,
>> 0, 1]))
>> O[46]: (nan, nan)
>> again I should be aware since the division has 0 in it.
>> after masking:
>> I[47]: a1 = np.ma.masked_equal([4,1,3,0,0,1], 0)
>> I[48]: a2 = np.ma.masked_equal([4,2,2,0,0,1], 0)
>> Further,
>> I[49]: stats.chisquare(a1, a2)
>> O[49]: (1.0, 0.96256577324729631)
>> I[50]: stats.mstats.chisquare(a1, a2)
>> O[50]: (1.0, 0.80125195690120077)
> masking doesn't remove the values, so when you have a masked array,
> then you should use compressed or similar
>
> dropping the zero bins

You should use the masked version of chisquare() in mstats for masked 
array inputs. However, hiding zeros is not correct unless both observed 
and expected equal zero.

>>>> stats.chisquare(np.array([4, 1, 3, 1.]),np.array([4, 2, 2, 1.]))
> (1.0, 0.80125195690120077)
>
> Not accepting list is a bug

It is not a bug because the docstring says arrays not array-like.


> Returning nans in the case when you expect  zero in a bin might be by
> design. But we need to check this.
>
>>>> stats.chisquare(np.array([4, 1, 3, 1.]),np.array([4, 2, 0, 1.]))
> (inf, nan)

This is correct since the expected value for a cell is zero (results in 
division by zero). You can not use the chi-square test in this 
situation. You might be able to get the fisher exact test (see ticket 
956 http://projects.scipy.org/scipy/ticket/956) to work here.

If you are doing something like density estimation then you probably 
need to select your bins (especially in the tails) more carefully to 
avoid this from happening.

Bruce




More information about the SciPy-User mailing list