[SciPy-Dev] Fitting distributions [Was Re: Warnings raised (from fit in scipy.stats)]

Fri Jun 11 18:47:16 EDT 2010

On Fri, Jun 11, 2010 at 6:03 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
> On Fri, Jun 11, 2010 at 5:58 PM,  <josef.pktd at gmail.com> wrote:
>> On Fri, Jun 11, 2010 at 5:47 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
>>> On Fri, Jun 11, 2010 at 5:42 PM,  <josef.pktd at gmail.com> wrote:
>>>> On Fri, Jun 11, 2010 at 5:11 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
>>>>> On Fri, Jun 11, 2010 at 2:49 PM, Vincent Davis <vincent at vincentdavis.net> wrote:
>>>>>> On Fri, Jun 11, 2010 at 12:20 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
>>>>>>> On Fri, Jun 11, 2010 at 2:17 PM, Vincent Davis <vincent at vincentdavis.net> wrote:
>>>>>>>> On Fri, Jun 11, 2010 at 11:34 AM, Skipper Seabold <jsseabold at gmail.com> wrote:
>>>>>>>>> On Fri, Jun 11, 2010 at 1:07 PM,  <josef.pktd at gmail.com> wrote:
>>>>>>>>>> On Fri, Jun 11, 2010 at 12:45 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
>>>>>>>>>>> Since the raising of warning behavior has been changed (I believe), I
>>>>>>>>>>> have been running into a lot of warnings in my code when say I do
>>>>>>>>>>> something like
>>>>>>>>>>>
>>>>>>>>>>> In [120]: from scipy import stats
>>>>>>>>>>>
>>>>>>>>>>> In [121]: y = [-45, -3, 1, 0, 1, 3]
>>>>>>>>>>>
>>>>>>>>>>> In [122]: v = stats.norm.pdf(y)/stats.norm.cdf(y)
>>>>>>>>>>> Warning: invalid value encountered in divide
>>>>>>>>>>>
>>>>>>>>>>> Sometimes, this is useful to know.  Sometimes, though, it's very
>>>>>>>>>>> disturbing when it's encountered in some kind of iteration or
>>>>>>>>>>> optimization.  I have been using numpy.clip to get around this in my
>>>>>>>>>>> own code, but when it's buried a bit deeper, it's not quite so simple.
>>>>>>>>>>>
>>>>>>>>>>> Take this example.
>>>>>>>>>>>
>>>>>>>>>>> In [123]: import numpy as np
>>>>>>>>>>>
>>>>>>>>>>> In [124]: np.random.seed(12345)
>>>>>>>>>>>
>>>>>>>>>>> In [125]: B = 6.0
>>>>>>>>>>>
>>>>>>>>>>> In [126]: x = np.random.exponential(scale=B, size=5000)
>>>>>>>>>>>
>>>>>>>>>>> In [127]: from scipy.stats import expon
>>>>>>>>>>>
>>>>>>>>>>> In [128]: expon.fit(x)
>>>>>>>>>>>
>>>>>>>>>>> <dozens of warnings clipped>
>>>>>>>>>>>
>>>>>>>>>>> Out[128]: (0.21874043533906118, 5.7122829778172939)
>>>>>>>>>>>
>>>>>>>>>>> The fit is achieved by fmin (as far as I know, since disp=0 in the
>>>>>>>>>>> rv_continuous.fit...), but there are a number of warnings emitted.  Is
>>>>>>>>>>> there any middle ground to be had in these type of situations via
>>>>>>>>>>> context management perhaps?
>>>>>>>>>>>
>>>>>>>>>>> Should I file a ticket?
>>>>>>>>>>
>>>>>>>>>> Which numpy scipy versions are you using?
>>>>>>>>
>>>>>>>> I get know warnings
>>>>>>>>>>> import numpy as np
>>>>>>>>>>> np.random.seed(12345)
>>>>>>>>>>> B = 6.0
>>>>>>>>>>> x = np.random.exponential(scale=B, size=5000)
>>>>>>>>>>> from scipy.stats import expon
>>>>>>>>>>> expon.fit(x)
>>>>>>>> array([  6.43573559e-04,   5.93058867e+00])
>>>>>>>>
>>>>>>>
>>>>>>> You also get different values than I do, which shouldn't be the case.
>>>>>>>
>>>>>>> I just discovered that my expon.fit(x) just returns the first and
>>>>>>> second moments of the data (even when I set floc = 0, I still get the
>>>>>>> second moment), and I am trying to figure out why.  It looks like
>>>>>>> something is amiss.
>>>>>>
>>>>>
>>>>> So maybe I am missing something (quite likely), but the reason that
>>>>> the expon.fit(x), (silently) doesn't work in the above is that
>>>>> expon.nnlf returns inf for the default start loc.
>>>>>
>>>>> Should fit_loc_scale be overwritten for expon?
>>>>>
>>>>> In [60]: expon.fit_loc_scale(x)
>>>>> Out[60]: (0.21874043533906118, 5.7122829778172939)
>>>>>
>>>>> In [61]: expon.nnlf(expon.fit_loc_scale(x),x)
>>>>> Out[61]: inf
>>>>>
>>>>> Changing fmin to disp=1, gives
>>>>>
>>>>> In[62]: expon.fit(x)
>>>>>
>>>>> <snip all the warnings, except from the solver>
>>>>>
>>>>> Warning: Maximum number of function evaluations has been exceeded.
>>>>> Out[62]: (0.21874043533906118, 5.7122829778172939)
>>>>>
>>>>> The default loc is defined (for this case as)
>>>>>
>>>>> loc = x.mean() - x.std()
>>>>>
>>>>> so for any x <= loc it is outside of the domain of the exponential
>>>>> distribution when it gets centered in nnlf.
>>>>>
>>>>> But
>>>>>
>>>>> In [63]: expon.fit(x,loc=0)
>>>>> Optimization terminated successfully.
>>>>>         Current function value: 13900.441325
>>>>>         Iterations: 59
>>>>>         Function evaluations: 107
>>>>> Out[63]: (0.00064357307755842945, 5.9303783425183241)
>>>>
>>>> I haven't looked yet in details at the new fit method.
>>>>
>>>> But for many distributions, especially those with a finite boundary,
>>>> fit works only if the starting values are well chosen. In this case, I
>>>> would set the starting value for loc at (x.min() - a little bit),
>>>> unless loc is frozen at zero.
>>>>
>>>
>>> Yeah that's what I settled on, though it wasn't obvious (to me) and
>>> took some digging to get to the bottom of.
>>>
>>>> I didn't check where the default starting value for expon has been changed.
>>>>
>>>
>>> It's inherited from rv_continuous.
>>
>> If you need starting values for more distributions, I have a script
>> where I categorize all distributions, by whether the support is open,
>> left-bounded, right-bounded, bounded or ambiguous, and choose
>> appropriate starting values.
>>
>
> Could it at some point be used within the distributions for
> distribution-specific starting values?

That's the plan, but it's still quite a bit of work to get it working
out of the box for all distributions. Cases like exponential
(distributions with lower bound at zero) would be relatively easy.

Josef

>
>> But this is not yet merged with a semi-frozen fit method.
>>
>> fit for distributions that have support on the entire real line work
>> pretty well.
>>
>
> Ah, ok.  I was just working on the generic likelihood model and
> happened to start my comparison with an exponential distribution,
> which turned out to not be such a good choice.
>
> Skipper
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>