[SciPy-Dev] Fitting distributions [Was Re: Warnings raised (from fit in scipy.stats)]

Fri Jun 11 17:47:28 EDT 2010

On Fri, Jun 11, 2010 at 5:42 PM,  <josef.pktd at gmail.com> wrote:
> On Fri, Jun 11, 2010 at 5:11 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
>> On Fri, Jun 11, 2010 at 2:49 PM, Vincent Davis <vincent at vincentdavis.net> wrote:
>>> On Fri, Jun 11, 2010 at 12:20 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
>>>> On Fri, Jun 11, 2010 at 2:17 PM, Vincent Davis <vincent at vincentdavis.net> wrote:
>>>>> On Fri, Jun 11, 2010 at 11:34 AM, Skipper Seabold <jsseabold at gmail.com> wrote:
>>>>>> On Fri, Jun 11, 2010 at 1:07 PM,  <josef.pktd at gmail.com> wrote:
>>>>>>> On Fri, Jun 11, 2010 at 12:45 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
>>>>>>>> Since the raising of warning behavior has been changed (I believe), I
>>>>>>>> have been running into a lot of warnings in my code when say I do
>>>>>>>> something like
>>>>>>>>
>>>>>>>> In [120]: from scipy import stats
>>>>>>>>
>>>>>>>> In [121]: y = [-45, -3, 1, 0, 1, 3]
>>>>>>>>
>>>>>>>> In [122]: v = stats.norm.pdf(y)/stats.norm.cdf(y)
>>>>>>>> Warning: invalid value encountered in divide
>>>>>>>>
>>>>>>>> Sometimes, this is useful to know.  Sometimes, though, it's very
>>>>>>>> disturbing when it's encountered in some kind of iteration or
>>>>>>>> optimization.  I have been using numpy.clip to get around this in my
>>>>>>>> own code, but when it's buried a bit deeper, it's not quite so simple.
>>>>>>>>
>>>>>>>> Take this example.
>>>>>>>>
>>>>>>>> In [123]: import numpy as np
>>>>>>>>
>>>>>>>> In [124]: np.random.seed(12345)
>>>>>>>>
>>>>>>>> In [125]: B = 6.0
>>>>>>>>
>>>>>>>> In [126]: x = np.random.exponential(scale=B, size=5000)
>>>>>>>>
>>>>>>>> In [127]: from scipy.stats import expon
>>>>>>>>
>>>>>>>> In [128]: expon.fit(x)
>>>>>>>>
>>>>>>>> <dozens of warnings clipped>
>>>>>>>>
>>>>>>>> Out[128]: (0.21874043533906118, 5.7122829778172939)
>>>>>>>>
>>>>>>>> The fit is achieved by fmin (as far as I know, since disp=0 in the
>>>>>>>> rv_continuous.fit...), but there are a number of warnings emitted.  Is
>>>>>>>> there any middle ground to be had in these type of situations via
>>>>>>>> context management perhaps?
>>>>>>>>
>>>>>>>> Should I file a ticket?
>>>>>>>
>>>>>>> Which numpy scipy versions are you using?
>>>>>
>>>>> I get know warnings
>>>>>>>> import numpy as np
>>>>>>>> np.random.seed(12345)
>>>>>>>> B = 6.0
>>>>>>>> x = np.random.exponential(scale=B, size=5000)
>>>>>>>> from scipy.stats import expon
>>>>>>>> expon.fit(x)
>>>>> array([  6.43573559e-04,   5.93058867e+00])
>>>>>
>>>>
>>>> You also get different values than I do, which shouldn't be the case.
>>>>
>>>> I just discovered that my expon.fit(x) just returns the first and
>>>> second moments of the data (even when I set floc = 0, I still get the
>>>> second moment), and I am trying to figure out why.  It looks like
>>>> something is amiss.
>>>
>>
>> So maybe I am missing something (quite likely), but the reason that
>> the expon.fit(x), (silently) doesn't work in the above is that
>> expon.nnlf returns inf for the default start loc.
>>
>> Should fit_loc_scale be overwritten for expon?
>>
>> In [60]: expon.fit_loc_scale(x)
>> Out[60]: (0.21874043533906118, 5.7122829778172939)
>>
>> In [61]: expon.nnlf(expon.fit_loc_scale(x),x)
>> Out[61]: inf
>>
>> Changing fmin to disp=1, gives
>>
>> In[62]: expon.fit(x)
>>
>> <snip all the warnings, except from the solver>
>>
>> Warning: Maximum number of function evaluations has been exceeded.
>> Out[62]: (0.21874043533906118, 5.7122829778172939)
>>
>> The default loc is defined (for this case as)
>>
>> loc = x.mean() - x.std()
>>
>> so for any x <= loc it is outside of the domain of the exponential
>> distribution when it gets centered in nnlf.
>>
>> But
>>
>> In [63]: expon.fit(x,loc=0)
>> Optimization terminated successfully.
>>         Current function value: 13900.441325
>>         Iterations: 59
>>         Function evaluations: 107
>> Out[63]: (0.00064357307755842945, 5.9303783425183241)
>
> I haven't looked yet in details at the new fit method.
>
> But for many distributions, especially those with a finite boundary,
> fit works only if the starting values are well chosen. In this case, I
> would set the starting value for loc at (x.min() - a little bit),
> unless loc is frozen at zero.
>

Yeah that's what I settled on, though it wasn't obvious (to me) and
took some digging to get to the bottom of.

> I didn't check where the default starting value for expon has been changed.
>

It's inherited from rv_continuous.

Skipper