[SciPy-Dev] Deprecate planck distribution?

Thu Jan 3 16:29:22 EST 2019

On Thu, Jan 3, 2019 at 9:22 AM Ali Cetin <ali.cetin at outlook.com> wrote:

>
>
> ------------------------------
> *From:* SciPy-Dev <scipy-dev-bounces+ali.cetin=outlook.com at python.org> on
> behalf of Robert Kern <robert.kern at gmail.com>
> *Sent:* Wednesday, January 2, 2019 21:07
> *To:* SciPy Developers List
> *Subject:* Re: [SciPy-Dev] Deprecate planck distribution?
>
> On Wed, Jan 2, 2019 at 1:36 AM Christoph Baumgarten <
> christoph.baumgarten at gmail.com> wrote:
> >
> > Hi all,
> >
> > happy new year!
> >
> > I noted that the Planck distribution is a geometric distribution with a
> different parametrization, see Issue #9359:
> >
> > import numpy as np
> > from scipy.stats import planck, geom
> >
> > a = 0.5
> > k = np.arange(20)
> > sum(abs(geom.pmf(k, 1-np.exp(-a), loc=-1) - planck.pmf(k, a))) # 1.30e-18
> >
> > I don't know if there is a specific reason to have the Planck
> distribution in addition to the geometric. If not, I would propose to
> deprecate it.
> >
> > Any views? Thanks
>
> If we were to turn back time, and the question was whether to *add* the
> Planck distribution given that we had the geometric distribution, I would
> probably be convinced by this. However, given that the Planck distribution
> has already been added, I don't think that it's worth removing it. The
> marginal cost to having this alternate parameterization is likely less than
> the cost of anyone changing their code.
>
> The collection of probability distributions are also a place where some
> nontrivial duplication actually has some positive value. People typically
> come to `scipy.stats` with a distribution (with a name and specific
> parameterization conventions) already in mind. Having more than one
> parameterization available helps people recognize the distribution that
> they want; having an alternate present doesn't impair the search task while
> not having one they are looking for (or burying it in the Notes of the
> docstring of the canonical version) can make the search task much harder.
> It's a common complaint that `scipy.stats` doesn't expose certain common
> parameterizations of distributions, so we should probably be working to
> expand the collection of parameterizations rather than collapsing them.
>
>
> Robert Kern
>
> I agree with Robert on this one. If you want to go down that rat hole, you
> will quickly find that most distribution functions are mere special cases
> and/or alternative parameterizations of a few general classes of
> distributions. If the concern is code management, then it could be argued
> that an effort should be made on abstracting distribution functions from
> these more general classes. However, personally, I prefer transparency and
> consistency with established literature when it comes to parametrization.
>

I think there is a good reason for implementing special cases instead of
only general cases because then computational simplifications can be used,
e.g. using only general distribution with several extra parameters is
cumbersome and requires a lot more work for the user, e.g. in setting all
the extra parameters to their special case values.

This is not the case for pure reparameterization that still have the same
number of parameters.

The main straight jacket in the scipy.stats distribution case in terms of
parameterization is that all continuous distributions use the loc-scale
(plus possibly shape) parameterization.
I think there are enough maintainers now (where I don't count myself), that
it would be feasible to add other distribution classes that don't have to
follow the loc-scale parameterization, or that could be intermediate
classes for groups of similar distributions.

For example, I think something similar to the frozen distribution class
could be added that is just a Reparameterization class, i.e. internally
delegates to a standard scipy distribution, but uses a parameterization and
parameter transformation that is more common and more familiar to users.
Another advantage of reparameterization classes would be that estimation is
often easier or more interpretable in a different parameterization. E.g.
statsmodels uses negativebinomial in the mean-dispersion parameterization
instead of the common negbin parameterization.
Another advantage of that is that the hessian, covariance of the parameter
estimates has often a nicer shape in different parameterization.

A example for a intermediate class would be common support for distribution
that are created by a transformation of another, mainly normal distribution.
This includes the Johnson system of distribution in the other open thread
on the list.

(Just some thoughts, I'm currently not in this neighborhood of stats.)

Josef

>
> That's my two cents on the issue.
>
> Cheers,
> Ali Cetin
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at python.org
> https://mail.python.org/mailman/listinfo/scipy-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20190103/f9e0f17f/attachment-0001.html>