[SciPy-Dev] Truncated distributions

Tue Nov 6 01:45:46 EST 2018

On Mon, Nov 5, 2018 at 2:54 AM Ali Cetin <ali.cetin at outlook.com> wrote:

>
> ------------------------------
>
> *From:* SciPy-Dev <scipy-dev-bounces+ali.cetin=outlook.com at python.org> on
> behalf of Warren Weckesser <warren.weckesser at gmail.com>
> *Sent:* Monday, November 5, 2018 10:20
> *To:* SciPy Developers List
> *Subject:* Re: [SciPy-Dev] Truncated distributions
>
> On 11/5/18, Robert Kern <robert.kern at gmail.com> wrote:
> > On Sun, Nov 4, 2018 at 10:36 PM Ralf Gommers <ralf.gommers at gmail.com>
> > wrote:
> >>
> >> On Sun, Nov 4, 2018 at 7:01 AM Ali Cetin <ali.cetin at outlook.com> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> I note that quite a few truncated distribution functions are available
> > in SciPy - nice!
> >>>
> >>> However, I find the usefulness of these functions somewhat limited when
> > it is desired to fit them to data; in most common scenarios the
> truncation
> > point is known (or even determined by the user/experimenter), and
> therefore
> > do not need to be treated as a free parameter. In the current scipy.stats
> > framework, the truncation parameters are accepted as "shape" parameters.
> > Therefore, it is only possible to lock the "normalized" truncation point
> > during fitting. This is a catch-22, since the user is required to provide
> > loc and scale parameters a priori, which are unknown.
> >>
> >> I'm not sure I follow. The fit() docstring says:  "Return MLEs for shape
> > (if applicable), location, and scale parameters from data." So it should
> be
> > fitting everything. Could you provide an example perhaps?
> >
> > The problem is with how we defined the truncation parameters of these
> > distributions, not `fit()` per se. You are supposed to specify them
> > *relative* to the `loc` and `scale`. So you could fit `truncnorm()` such
> > that the bounds are fixed to, say, (-3*sigma, +3*sigma) where `sigma` is
> > freely being adjusted during the fit but not to (-10, +10) regardless of
> > the `scale`. Ali wants to do the latter. `fit()` doesn't work for this
> use
> > case.
>
>
Thanks, that helps.

> >
> >>> I would like to propose a solution that only require minor adjustment
> of
> > the current framework: allow the user to supply the truncation point as
> > part of the data array -> for left and right truncation, the 0th and the
> > Nth element of the array is poped out and used as truncation parameters,
> > respectively.
> >>
> >> That doesn't look like a viable solution. Let's first establish if there
> > really is an issue like you're describing, and then look for a cleaner
> > addition to the API.
> >
> > I agree that this is not the right approach.
> >
> > Personally, I think that it's hard to get all of the corner cases right
> to
> > make `fit()` Just Work(TM). The distributions API has a lot of them. Most
> > of `fit()`'s implementation is just trying to work around as many of them
> > as it can in order to be general. But the core of it is pretty
> > straightforward: use a `scipy.optimize` minimizer on the `nnlf()` method.
> >
> > This is the way I approach it: if `fit()` does what I want it to do,
> great,
> > I'll use it. But if it looks hard to shoehorn what I want into `fit()`,
> > I'll just go ahead and call `scipy.optimize.minimize()` myself. `fit()`
> is
> > a good convenience when it works in the easy cases, but the alternative
> is
> > straightforward for the harder ones. I don't see much benefit to trying
> to
> > make `fit()` scale to the harder, weirder cases. I'd rather document the
> > path up to `scipy.optimize`. Ali's example would be a good recipe to
> > demonstrate this.
>
>
> I agree, and I recently did just this in an answer on stackoverflow:
>
> https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F53125437%2Ffitting-data-using-scipy-truncnorm&data=02%7C01%7C%7C076bdbb82e12488e888508d643000bd8%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636770064812506740&sdata=ecii5FOijuccC49X54%2BGVZsSJNhPow%2Fl%2BFc%2FeMZWBYQ%3D&reserved=0
>
> Warren
>
>
> Thanks for chiming in on this guys.
> I have created a issue on GitHub (
> https://github.com/scipy/scipy/issues/9439) with an example. (I hate
> reading and writing code on mail 😊).
>
> Anyway, I do agree with Robert and Warren that this is achievable with a
> few lines of code (as I also included in the example). However, is it fair
> to expect that a user without advanced understanding in
> statistics/optimization problems to deal with this? (Furthermore, you can
> make this argument about the regular fit method. Should we just remove the
> method altogether then? Of course not.). I mean, most of my colleagues in
> the industry are highly capable engineers, but statistics and probability
> theory is more or less a "black-box" to them.
>
> I think it would be better for the end-users to alter the API to natively
> accommodate for truncated distributions (as Ralf suggests).
>

The problem is that we're going to have corner cases like this for many
distributions, and we'd spend a lot of effort trying to cover them (and
failing to achieve full coverage).

I agree with Robert and Warren, clearly documenting how to deal with such
cases is the better solution here. A discussion and worked example in the
user guide, and linking to that from the rv_continuous.fit docstring is
probably the right place.

Cheers,
Ralf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20181105/7123f82c/attachment.html>