[SciPy-Dev] Fitting distributions [Was Re: Warnings raised (from fit in scipy.stats)]

Fri Jun 11 17:11:13 EDT 2010

On Fri, Jun 11, 2010 at 2:49 PM, Vincent Davis <vincent at vincentdavis.net> wrote:
> On Fri, Jun 11, 2010 at 12:20 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
>> On Fri, Jun 11, 2010 at 2:17 PM, Vincent Davis <vincent at vincentdavis.net> wrote:
>>> On Fri, Jun 11, 2010 at 11:34 AM, Skipper Seabold <jsseabold at gmail.com> wrote:
>>>> On Fri, Jun 11, 2010 at 1:07 PM,  <josef.pktd at gmail.com> wrote:
>>>>> On Fri, Jun 11, 2010 at 12:45 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
>>>>>> Since the raising of warning behavior has been changed (I believe), I
>>>>>> have been running into a lot of warnings in my code when say I do
>>>>>> something like
>>>>>>
>>>>>> In [120]: from scipy import stats
>>>>>>
>>>>>> In [121]: y = [-45, -3, 1, 0, 1, 3]
>>>>>>
>>>>>> In [122]: v = stats.norm.pdf(y)/stats.norm.cdf(y)
>>>>>> Warning: invalid value encountered in divide
>>>>>>
>>>>>> Sometimes, this is useful to know.  Sometimes, though, it's very
>>>>>> disturbing when it's encountered in some kind of iteration or
>>>>>> optimization.  I have been using numpy.clip to get around this in my
>>>>>> own code, but when it's buried a bit deeper, it's not quite so simple.
>>>>>>
>>>>>> Take this example.
>>>>>>
>>>>>> In [123]: import numpy as np
>>>>>>
>>>>>> In [124]: np.random.seed(12345)
>>>>>>
>>>>>> In [125]: B = 6.0
>>>>>>
>>>>>> In [126]: x = np.random.exponential(scale=B, size=5000)
>>>>>>
>>>>>> In [127]: from scipy.stats import expon
>>>>>>
>>>>>> In [128]: expon.fit(x)
>>>>>>
>>>>>> <dozens of warnings clipped>
>>>>>>
>>>>>> Out[128]: (0.21874043533906118, 5.7122829778172939)
>>>>>>
>>>>>> The fit is achieved by fmin (as far as I know, since disp=0 in the
>>>>>> rv_continuous.fit...), but there are a number of warnings emitted.  Is
>>>>>> there any middle ground to be had in these type of situations via
>>>>>> context management perhaps?
>>>>>>
>>>>>> Should I file a ticket?
>>>>>
>>>>> Which numpy scipy versions are you using?
>>>
>>> I get know warnings
>>>>>> import numpy as np
>>>>>> np.random.seed(12345)
>>>>>> B = 6.0
>>>>>> x = np.random.exponential(scale=B, size=5000)
>>>>>> from scipy.stats import expon
>>>>>> expon.fit(x)
>>> array([  6.43573559e-04,   5.93058867e+00])
>>>
>>
>> You also get different values than I do, which shouldn't be the case.
>>
>> I just discovered that my expon.fit(x) just returns the first and
>> second moments of the data (even when I set floc = 0, I still get the
>> second moment), and I am trying to figure out why.  It looks like
>> something is amiss.
>

So maybe I am missing something (quite likely), but the reason that
the expon.fit(x), (silently) doesn't work in the above is that
expon.nnlf returns inf for the default start loc.

Should fit_loc_scale be overwritten for expon?

In [60]: expon.fit_loc_scale(x)
Out[60]: (0.21874043533906118, 5.7122829778172939)

In [61]: expon.nnlf(expon.fit_loc_scale(x),x)
Out[61]: inf

Changing fmin to disp=1, gives

In[62]: expon.fit(x)

<snip all the warnings, except from the solver>

Warning: Maximum number of function evaluations has been exceeded.
Out[62]: (0.21874043533906118, 5.7122829778172939)

The default loc is defined (for this case as)

loc = x.mean() - x.std()

so for any x <= loc it is outside of the domain of the exponential
distribution when it gets centered in nnlf.

But

In [63]: expon.fit(x,loc=0)
Optimization terminated successfully.
         Current function value: 13900.441325
         Iterations: 59
         Function evaluations: 107
Out[63]: (0.00064357307755842945, 5.9303783425183241)

Skipper