[SciPy-Dev] Fitting distributions [Was Re: Warnings raised (from fit in scipy.stats)]

Fri Jun 11 18:03:59 EDT 2010

On Fri, Jun 11, 2010 at 5:58 PM,  <josef.pktd at gmail.com> wrote:
> On Fri, Jun 11, 2010 at 5:47 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
>> On Fri, Jun 11, 2010 at 5:42 PM,  <josef.pktd at gmail.com> wrote:
>>> On Fri, Jun 11, 2010 at 5:11 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
>>>> On Fri, Jun 11, 2010 at 2:49 PM, Vincent Davis <vincent at vincentdavis.net> wrote:
>>>>> On Fri, Jun 11, 2010 at 12:20 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
>>>>>> On Fri, Jun 11, 2010 at 2:17 PM, Vincent Davis <vincent at vincentdavis.net> wrote:
>>>>>>> On Fri, Jun 11, 2010 at 11:34 AM, Skipper Seabold <jsseabold at gmail.com> wrote:
>>>>>>>> On Fri, Jun 11, 2010 at 1:07 PM,  <josef.pktd at gmail.com> wrote:
>>>>>>>>> On Fri, Jun 11, 2010 at 12:45 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
>>>>>>>>>> Since the raising of warning behavior has been changed (I believe), I
>>>>>>>>>> have been running into a lot of warnings in my code when say I do
>>>>>>>>>> something like
>>>>>>>>>>
>>>>>>>>>> In [120]: from scipy import stats
>>>>>>>>>>
>>>>>>>>>> In [121]: y = [-45, -3, 1, 0, 1, 3]
>>>>>>>>>>
>>>>>>>>>> In [122]: v = stats.norm.pdf(y)/stats.norm.cdf(y)
>>>>>>>>>> Warning: invalid value encountered in divide
>>>>>>>>>>
>>>>>>>>>> Sometimes, this is useful to know.  Sometimes, though, it's very
>>>>>>>>>> disturbing when it's encountered in some kind of iteration or
>>>>>>>>>> optimization.  I have been using numpy.clip to get around this in my
>>>>>>>>>> own code, but when it's buried a bit deeper, it's not quite so simple.
>>>>>>>>>>
>>>>>>>>>> Take this example.
>>>>>>>>>>
>>>>>>>>>> In [123]: import numpy as np
>>>>>>>>>>
>>>>>>>>>> In [124]: np.random.seed(12345)
>>>>>>>>>>
>>>>>>>>>> In [125]: B = 6.0
>>>>>>>>>>
>>>>>>>>>> In [126]: x = np.random.exponential(scale=B, size=5000)
>>>>>>>>>>
>>>>>>>>>> In [127]: from scipy.stats import expon
>>>>>>>>>>
>>>>>>>>>> In [128]: expon.fit(x)
>>>>>>>>>>
>>>>>>>>>> <dozens of warnings clipped>
>>>>>>>>>>
>>>>>>>>>> Out[128]: (0.21874043533906118, 5.7122829778172939)
>>>>>>>>>>
>>>>>>>>>> The fit is achieved by fmin (as far as I know, since disp=0 in the
>>>>>>>>>> rv_continuous.fit...), but there are a number of warnings emitted.  Is
>>>>>>>>>> there any middle ground to be had in these type of situations via
>>>>>>>>>> context management perhaps?
>>>>>>>>>>
>>>>>>>>>> Should I file a ticket?
>>>>>>>>>
>>>>>>>>> Which numpy scipy versions are you using?
>>>>>>>
>>>>>>> I get know warnings
>>>>>>>>>> import numpy as np
>>>>>>>>>> np.random.seed(12345)
>>>>>>>>>> B = 6.0
>>>>>>>>>> x = np.random.exponential(scale=B, size=5000)
>>>>>>>>>> from scipy.stats import expon
>>>>>>>>>> expon.fit(x)
>>>>>>> array([  6.43573559e-04,   5.93058867e+00])
>>>>>>>
>>>>>>
>>>>>> You also get different values than I do, which shouldn't be the case.
>>>>>>
>>>>>> I just discovered that my expon.fit(x) just returns the first and
>>>>>> second moments of the data (even when I set floc = 0, I still get the
>>>>>> second moment), and I am trying to figure out why.  It looks like
>>>>>> something is amiss.
>>>>>
>>>>
>>>> So maybe I am missing something (quite likely), but the reason that
>>>> the expon.fit(x), (silently) doesn't work in the above is that
>>>> expon.nnlf returns inf for the default start loc.
>>>>
>>>> Should fit_loc_scale be overwritten for expon?
>>>>
>>>> In [60]: expon.fit_loc_scale(x)
>>>> Out[60]: (0.21874043533906118, 5.7122829778172939)
>>>>
>>>> In [61]: expon.nnlf(expon.fit_loc_scale(x),x)
>>>> Out[61]: inf
>>>>
>>>> Changing fmin to disp=1, gives
>>>>
>>>> In[62]: expon.fit(x)
>>>>
>>>> <snip all the warnings, except from the solver>
>>>>
>>>> Warning: Maximum number of function evaluations has been exceeded.
>>>> Out[62]: (0.21874043533906118, 5.7122829778172939)
>>>>
>>>> The default loc is defined (for this case as)
>>>>
>>>> loc = x.mean() - x.std()
>>>>
>>>> so for any x <= loc it is outside of the domain of the exponential
>>>> distribution when it gets centered in nnlf.
>>>>
>>>> But
>>>>
>>>> In [63]: expon.fit(x,loc=0)
>>>> Optimization terminated successfully.
>>>>         Current function value: 13900.441325
>>>>         Iterations: 59
>>>>         Function evaluations: 107
>>>> Out[63]: (0.00064357307755842945, 5.9303783425183241)
>>>
>>> I haven't looked yet in details at the new fit method.
>>>
>>> But for many distributions, especially those with a finite boundary,
>>> fit works only if the starting values are well chosen. In this case, I
>>> would set the starting value for loc at (x.min() - a little bit),
>>> unless loc is frozen at zero.
>>>
>>
>> Yeah that's what I settled on, though it wasn't obvious (to me) and
>> took some digging to get to the bottom of.
>>
>>> I didn't check where the default starting value for expon has been changed.
>>>
>>
>> It's inherited from rv_continuous.
>
> If you need starting values for more distributions, I have a script
> where I categorize all distributions, by whether the support is open,
> left-bounded, right-bounded, bounded or ambiguous, and choose
> appropriate starting values.
>

Could it at some point be used within the distributions for
distribution-specific starting values?

> But this is not yet merged with a semi-frozen fit method.
>
> fit for distributions that have support on the entire real line work
> pretty well.
>

Ah, ok.  I was just working on the generic likelihood model and
happened to start my comparison with an exponential distribution,
which turned out to not be such a good choice.

Skipper