[SciPy-Dev] Truncated distributions

Tue Nov 6 04:51:58 EST 2018

________________________________
From: SciPy-Dev <scipy-dev-bounces+ali.cetin=outlook.com at python.org> on behalf of Ralf Gommers <ralf.gommers at gmail.com>
Sent: Tuesday, November 6, 2018 07:45
To: SciPy Developers List
Subject: Re: [SciPy-Dev] Truncated distributions

On Mon, Nov 5, 2018 at 2:54 AM Ali Cetin <ali.cetin at outlook.com<mailto:ali.cetin at outlook.com>> wrote:

________________________________
From: SciPy-Dev <scipy-dev-bounces+ali.cetin=outlook.com at python.org<mailto:outlook.com at python.org>> on behalf of Warren Weckesser <warren.weckesser at gmail.com<mailto:warren.weckesser at gmail.com>>
Sent: Monday, November 5, 2018 10:20
To: SciPy Developers List
Subject: Re: [SciPy-Dev] Truncated distributions

On 11/5/18, Robert Kern <robert.kern at gmail.com<mailto:robert.kern at gmail.com>> wrote:
> On Sun, Nov 4, 2018 at 10:36 PM Ralf Gommers <ralf.gommers at gmail.com<mailto:ralf.gommers at gmail.com>>
> wrote:
>>
>> On Sun, Nov 4, 2018 at 7:01 AM Ali Cetin <ali.cetin at outlook.com<mailto:ali.cetin at outlook.com>> wrote:
>>>
>>> Hi all,
>>>
>>> I note that quite a few truncated distribution functions are available
> in SciPy - nice!
>>>
>>> However, I find the usefulness of these functions somewhat limited when
> it is desired to fit them to data; in most common scenarios the truncation
> point is known (or even determined by the user/experimenter), and therefore
> do not need to be treated as a free parameter. In the current scipy.stats
> framework, the truncation parameters are accepted as "shape" parameters.
> Therefore, it is only possible to lock the "normalized" truncation point
> during fitting. This is a catch-22, since the user is required to provide
> loc and scale parameters a priori, which are unknown.
>>
>> I'm not sure I follow. The fit() docstring says:  "Return MLEs for shape
> (if applicable), location, and scale parameters from data." So it should be
> fitting everything. Could you provide an example perhaps?
>
> The problem is with how we defined the truncation parameters of these
> distributions, not `fit()` per se. You are supposed to specify them
> *relative* to the `loc` and `scale`. So you could fit `truncnorm()` such
> that the bounds are fixed to, say, (-3*sigma, +3*sigma) where `sigma` is
> freely being adjusted during the fit but not to (-10, +10) regardless of
> the `scale`. Ali wants to do the latter. `fit()` doesn't work for this use
> case.

Thanks, that helps.
>
>>> I would like to propose a solution that only require minor adjustment of
> the current framework: allow the user to supply the truncation point as
> part of the data array -> for left and right truncation, the 0th and the
> Nth element of the array is poped out and used as truncation parameters,
> respectively.
>>
>> That doesn't look like a viable solution. Let's first establish if there
> really is an issue like you're describing, and then look for a cleaner
> addition to the API.
>
> I agree that this is not the right approach.
>
> Personally, I think that it's hard to get all of the corner cases right to
> make `fit()` Just Work(TM). The distributions API has a lot of them. Most
> of `fit()`'s implementation is just trying to work around as many of them
> as it can in order to be general. But the core of it is pretty
> straightforward: use a `scipy.optimize` minimizer on the `nnlf()` method.
>
> This is the way I approach it: if `fit()` does what I want it to do, great,
> I'll use it. But if it looks hard to shoehorn what I want into `fit()`,
> I'll just go ahead and call `scipy.optimize.minimize()` myself. `fit()` is
> a good convenience when it works in the easy cases, but the alternative is
> straightforward for the harder ones. I don't see much benefit to trying to
> make `fit()` scale to the harder, weirder cases. I'd rather document the
> path up to `scipy.optimize`. Ali's example would be a good recipe to
> demonstrate this.

I agree, and I recently did just this in an answer on stackoverflow:
https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F53125437%2Ffitting-data-using-scipy-truncnorm&data=02%7C01%7C%7C076bdbb82e12488e888508d643000bd8%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636770064812506740&sdata=ecii5FOijuccC49X54%2BGVZsSJNhPow%2Fl%2BFc%2FeMZWBYQ%3D&reserved=0<https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F53125437%2Ffitting-data-using-scipy-truncnorm&data=02%7C01%7C%7C780112a2d6cf4f77d5a608d643b393ae%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636770835881426297&sdata=hg5PPm6fxyXZdAkgn57zyr7ZDKl2RWybCKXMoMnlsgI%3D&reserved=0>

Warren

Thanks for chiming in on this guys.
I have created a issue on GitHub (https://github.com/scipy/scipy/issues/9439<https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fscipy%2Fscipy%2Fissues%2F9439&data=02%7C01%7C%7C780112a2d6cf4f77d5a608d643b393ae%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636770835881426297&sdata=0NYhjUdmcwjbE3mNnqJxNJcuwL2aQzQF55za%2F7qw1LE%3D&reserved=0>) with an example. (I hate reading and writing code on mail 😊).

Anyway, I do agree with Robert and Warren that this is achievable with a few lines of code (as I also included in the example). However, is it fair to expect that a user without advanced understanding in statistics/optimization problems to deal with this? (Furthermore, you can make this argument about the regular fit method. Should we just remove the method altogether then? Of course not.). I mean, most of my colleagues in the industry are highly capable engineers, but statistics and probability theory is more or less a "black-box" to them.

I think it would be better for the end-users to alter the API to natively accommodate for truncated distributions (as Ralf suggests).

The problem is that we're going to have corner cases like this for many distributions, and we'd spend a lot of effort trying to cover them (and failing to achieve full coverage).

I agree with Robert and Warren, clearly documenting how to deal with such cases is the better solution here. A discussion and worked example in the user guide, and linking to that from the rv_continuous.fit docstring is probably the right place.

Cheers,
Ralf
I can contribute with documenting this, just direct me towards the right place.

But we have to keep in mind that it is a private method (rv_continous._penalized_nnlf) is used internally by the optimizer, and not the public nnlf method. The private method provides some boiler plate and safe guards. How about renaming it to rv_continous._penalized_nnlf -> rv_continous.objective_function, thus making the "work around" you suggest more transparent. Also, this allows the users to define custom objective functions when needed, without messing around with internal methods.

Cheers,
Ali
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20181106/a115715e/attachment-0001.html>