[SciPy-Dev] Truncated distributions

Mon Nov 5 05:54:09 EST 2018

________________________________
From: SciPy-Dev <scipy-dev-bounces+ali.cetin=outlook.com at python.org> on behalf of Warren Weckesser <warren.weckesser at gmail.com>
Sent: Monday, November 5, 2018 10:20
To: SciPy Developers List
Subject: Re: [SciPy-Dev] Truncated distributions

On 11/5/18, Robert Kern <robert.kern at gmail.com> wrote:
> On Sun, Nov 4, 2018 at 10:36 PM Ralf Gommers <ralf.gommers at gmail.com>
> wrote:
>>
>> On Sun, Nov 4, 2018 at 7:01 AM Ali Cetin <ali.cetin at outlook.com> wrote:
>>>
>>> Hi all,
>>>
>>> I note that quite a few truncated distribution functions are available
> in SciPy - nice!
>>>
>>> However, I find the usefulness of these functions somewhat limited when
> it is desired to fit them to data; in most common scenarios the truncation
> point is known (or even determined by the user/experimenter), and therefore
> do not need to be treated as a free parameter. In the current scipy.stats
> framework, the truncation parameters are accepted as "shape" parameters.
> Therefore, it is only possible to lock the "normalized" truncation point
> during fitting. This is a catch-22, since the user is required to provide
> loc and scale parameters a priori, which are unknown.
>>
>> I'm not sure I follow. The fit() docstring says:  "Return MLEs for shape
> (if applicable), location, and scale parameters from data." So it should be
> fitting everything. Could you provide an example perhaps?
>
> The problem is with how we defined the truncation parameters of these
> distributions, not `fit()` per se. You are supposed to specify them
> *relative* to the `loc` and `scale`. So you could fit `truncnorm()` such
> that the bounds are fixed to, say, (-3*sigma, +3*sigma) where `sigma` is
> freely being adjusted during the fit but not to (-10, +10) regardless of
> the `scale`. Ali wants to do the latter. `fit()` doesn't work for this use
> case.
>
>>> I would like to propose a solution that only require minor adjustment of
> the current framework: allow the user to supply the truncation point as
> part of the data array -> for left and right truncation, the 0th and the
> Nth element of the array is poped out and used as truncation parameters,
> respectively.
>>
>> That doesn't look like a viable solution. Let's first establish if there
> really is an issue like you're describing, and then look for a cleaner
> addition to the API.
>
> I agree that this is not the right approach.
>
> Personally, I think that it's hard to get all of the corner cases right to
> make `fit()` Just Work(TM). The distributions API has a lot of them. Most
> of `fit()`'s implementation is just trying to work around as many of them
> as it can in order to be general. But the core of it is pretty
> straightforward: use a `scipy.optimize` minimizer on the `nnlf()` method.
>
> This is the way I approach it: if `fit()` does what I want it to do, great,
> I'll use it. But if it looks hard to shoehorn what I want into `fit()`,
> I'll just go ahead and call `scipy.optimize.minimize()` myself. `fit()` is
> a good convenience when it works in the easy cases, but the alternative is
> straightforward for the harder ones. I don't see much benefit to trying to
> make `fit()` scale to the harder, weirder cases. I'd rather document the
> path up to `scipy.optimize`. Ali's example would be a good recipe to
> demonstrate this.

I agree, and I recently did just this in an answer on stackoverflow:
https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F53125437%2Ffitting-data-using-scipy-truncnorm&data=02%7C01%7C%7C076bdbb82e12488e888508d643000bd8%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636770064812506740&sdata=ecii5FOijuccC49X54%2BGVZsSJNhPow%2Fl%2BFc%2FeMZWBYQ%3D&reserved=0

Warren

Thanks for chiming in on this guys.
I have created a issue on GitHub (https://github.com/scipy/scipy/issues/9439) with an example. (I hate reading and writing code on mail 😊).

Anyway, I do agree with Robert and Warren that this is achievable with a few lines of code (as I also included in the example). However, is it fair to expect that a user without advanced understanding in statistics/optimization problems to deal with this? (Furthermore, you can make this argument about the regular fit method. Should we just remove the method altogether then? Of course not.). I mean, most of my colleagues in the industry are highly capable engineers, but statistics and probability theory is more or less a "black-box" to them.

I think it would be better for the end-users to alter the API to natively accommodate for truncated distributions (as Ralf suggests).

Cheers,
Ali

>
> --
> Robert Kern
>
_______________________________________________
SciPy-Dev mailing list
SciPy-Dev at python.org
https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.python.org%2Fmailman%2Flistinfo%2Fscipy-dev&data=02%7C01%7C%7C076bdbb82e12488e888508d643000bd8%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636770064812506740&sdata=AB7%2BTYDh9r%2F%2FoR6sn9c3%2BoJbEsj5X7QfFQ4DzQcDPUU%3D&reserved=0
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20181105/979104c0/attachment-0001.html>