[SciPy-dev] Bug/Error with chi-squared distribution and df<1

Tue Sep 22 14:27:48 EDT 2009

On Tue, Sep 22, 2009 at 2:04 PM, Kevin Dunn <kgdunn at gmail.com> wrote:
> Thanks for the prompt reply Josef; please see my comments below.
>
>> On Tue, Sep 22, 2009 at 10:59 AM, Kevin Dunn <kgdunn at gmail.com> wrote:
>>> Hi there,
>>>
>>> I'm not an expert on distributions, but as far as I can tell, the chi2
>>> distribution is defined for degrees of freedom >0. ?I'm getting "nan"
>>> results however when df is less than one (but still greater than 0).
>>> The chi2 CDF values agree between R, MATLAB and Scipy when the the
>>> degrees of freedom are >= 1. ?For example:
>>> * R: pchisq(0.95, 1)
>>> * MATLAB: chi2cdf(0.95, 1)
>>> * SciPy: scipy.stats.chi2.cdf(0.95, 1)
>>>
>>
>>
>>> However, changing the last argument to 0.5 returns a NaN in SciPy, but
>>> gives a result (0.8392259) in R and MATLAB.
>>
>>>>> stats.chi2.veccdf(0.95, 0.5)
>> array(0.83922587961194761)
>>
>>>
>>> I'm suspecting there is something wrong with my SciPy installation,
>>> because the example included with SciPy, (found using
>>> scipy.stats.chi2? inside iPython) calls the chi2.cdf function with
>>> df=0.9, yet when I run that example as follows:
>>
>> Where is this example? maybe it is an autogenerated example with wrong numbers?
>
> The example was found when typing "scipy.stats.chi2?" in ipython after
> importing scipy.stats.  The code below was taken from the example,
> only I omitted the matplotlib plotting commands.
>
>>> from scipy.stats import chi2
>>> import numpy as np
>>> numargs = chi2.numargs
>>> [ df ] = [0.9,]*numargs
>>> rv = chi2(df)
>>> x = np.linspace(0,np.minimum(rv.dist.b,3))
>>> prb = chi2.cdf(x,df)

Yes, this is a generic template with autofilled names, obviously not
tested if the numbers make sense in all cases.

>>>
>>> My result for prb is as follows; which I don't think would have been
>>> used as an example if this is the expected output.
>>> array([ ?0., ?NaN, ?NaN, ?NaN, ?NaN, ?NaN, ?NaN, ?NaN, ?NaN, ?NaN, ?NaN,
>>> ? ? ? ?NaN, ?NaN, ?NaN, ?NaN, ?NaN, ?NaN, ?NaN, ?NaN, ?NaN, ?NaN, ?NaN,
>>> ? ? ? ?NaN, ?NaN, ?NaN, ?NaN, ?NaN, ?NaN, ?NaN, ?NaN, ?NaN, ?NaN, ?NaN,
>>> ? ? ? ?NaN, ?NaN, ?NaN, ?NaN, ?NaN, ?NaN, ?NaN, ?NaN, ?NaN, ?NaN, ?NaN,
>>> ? ? ? ?NaN, ?NaN, ?NaN, ?NaN, ?NaN, ?NaN])
>>>
>>> Is my SciPy install messed up? ?I'm using MacOSX and downloaded
>>> version 0.7.1 from SourceForge this morning. ?I just tried it in
>>> Ubuntu Linux (version 0.7.0 though), and get the same results.
>>>
>>
>> chi2 uses scipy.special
>>
>> class chi2_gen
>>
>> def _cdf(self, x, df):
>>        return special.chdtr(df, x)
>>
>> which obviously cannot handle df<1 , which I don't know if it would
>> ever show up in the usual statistical tests.
>
> I don't believe it is obvious why special.chdtr can't handle values df
> < 1. Please see my additional comments below.
>
>>
>> pdf doesn't seem to have any problems
>>
>>>>> stats.chi2.pdf(np.arange(5),0.99)
>> array([        Inf,  0.24042373,  0.10275665,  0.05078513,  0.02663761])
>>>>> stats.chi2.pdf(np.arange(5),0.5)
>> array([        Inf,  0.14067411,  0.05073346,  0.02270277,  0.01109756])
>>>>> stats.chi2.pdf(np.arange(5),0.25)
>> array([        Inf,  0.07382471,  0.02441481,  0.01038547,  0.00489731])
>>
>> so numerical integration should also work
>>
>>>>> stats.chi2.veccdf(np.arange(1,5),0.25)
>> array([ 0.92605422,  0.96918041,  0.98539847,  0.99265551])
>>>>> stats.chi2.veccdf(np.linspace(0.01,10.0,11),0.25)
>> array([ 0.54726537,  0.92671456,  0.96937499,  0.98547096,  0.99268483,
>>        0.99618434,  0.99796201,  0.99889274,  0.99939058,  0.99966116,
>>        0.99981006])
>>>>> stats.chi2.veccdf(np.linspace(0.01,10.0,11),0.5)
>> array([ 0.29308089,  0.84774539,  0.93248332,  0.96674206,  0.98278044,
>>        0.99081467,  0.99500114,  0.99723983,  0.9984591 ,  0.99913231,
>>        0.99950797])
>>
>> since pdf at zero is inf, I don't know how good the numbers are if the
>> cdf is calculated for points close to zero
>> e.g.
>>
>>>>> stats.chi2.veccdf(1e-8,0.5)
>> array(0.0092772960765327081)
>>
>>
>> Since I don't think df<1 is a common case, I wouldn't want to switch
>> to numerical integration by default. But veccdf although only a
>> private function should work although it bypasses some of the checking
>> and broadcasting.
>
> Thanks for pointing out the numerical integration approach; 'll give
> it a try.  As for df < 1, please see my other comments below.
>
>>
>> Somehow this sounds familiar, I need to check whether this or a
>> similar case is already on record.
>>
>> Hope that helps,
>>
>> Josef
>
> And regarding your other reply:
>
>> I didn't find anything on the scipy trac nor on the mailinglist, since
>> I'm subscribed to them. If you think that df<1 is an important case
>> you could file a ticket for scipy.special. I have no idea how
>> difficult it would be to extend the chi squared related functions for
>> this or whether there exists another workaround.
>
> I just downloaded the latest SVN code and found the code that does the
> work is /scipy/special/cephes/chdtr.c
>
> In that code it seems that when the degrees of freedom are less than
> one, the code is (artificially) forced to return a NaN:
>        if (df < 1.0)
>        {
>            mtherr( "chdtrc", DOMAIN );
>            return(NPY_NAN);
>        }
>       return( igamc( df/2.0, x/2.0 ) );
>
> This seems a somewhat artificial constraint to me, since the gamma
> function can accept values of 0 < df < 1.  Does someone else on this
> list know why that constraint is there for df<1?

I don't know the c code, but many of the statistical functions, have
duplicate ways of calculation with scipy special.

Taking the hint with incomplete gamma, the following looks good. This
would mean until Pauli fixes scipy.special if it your fix works, we
could also use gammainc directly. I don't know the differences between
the various implementations well enough to see whether we buy some
other problems with this

>>> df=2;x=1.5;special.gammainc(df/2., x/2.)
0.52763344725898531
>>> df=0.5;x=1.5;special.gammainc(df/2., x/2.)
0.89993651328449831

>>> stats.chi2.cdf(x,df)
nan
>>> stats.chi2.cdf(1.5,2)
0.52763344725898531
>>> stats.chi2.veccdf(1.5,2)
array(0.52763344725898531)
>>> stats.chi2.veccdf(1.5,0.5)
array(0.89993651328445579)
>>> stats.chi2.veccdf(1.5,0.5) - special.gammainc(0.5/2., 1.5/2.)
-4.2521541843143495e-014

I'm used to the chi square distribution as a statistical test
distribution, but of course if it is treated just as a distribution
that is matched to data, then these limitations (and nans) are not
very useful.

Thanks for looking into this.

Josef

>
> As for a practical usage case, I'm computing CDF values for a variable
> by matching moments, and the value for the degrees of freedom term is
> a computed value.  This value is always non-integer, and can sometimes
> be less than one.
>
> Additional justification for df < 1 can be sought by looking at plots
> of the chi-squared distribution (e.g.
> http://en.wikipedia.org/wiki/Chi-square_distribution) when plotting
> with a degrees of freedom parameter (they call it "k" in the Wikipedia
> article).  There's no reason why df can't be less than one, anymore
> than there's reason for df being non-integer.
>
> I'll give it a go compiling SciPy with removing those constraints in
> chdtr.c and see how it works.  I've never done this before, but I'm
> busy installing the NumPy/SciPy installation requirements at the
> moment, and I'll report how it went.
>
> If things works out, I'll file a trac report.  I've also cc'd the
> scipy-dev list on this reply.
>
> Thanks for your help,
> Kevin
>
>> The only similar story was  for the negative binomial for n<1,
>> http://projects.scipy.org/scipy/ticket/978 . In that case
>> scipy.special.nbdtr didn't allow the extension for n<1.
>>
>> Josef
> _______________________________________________
> Scipy-dev mailing list
> Scipy-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>