[SciPy-Dev] scipy.stats documentation

Mon May 7 10:30:04 EDT 2012

On Mon, May 7, 2012 at 8:51 AM, Skipper Seabold <jsseabold at gmail.com> wrote:
> On Mon, May 7, 2012 at 7:51 AM, nicky van foreest <vanforeest at gmail.com>
> wrote:
>>
>> Hi,
>>
>> I am still struggling to understand some of the scipy stats package,
>> and ran into some obscure points.

I think the main point to understanding the distributions is to
realize that we know very little about individual distributions.

stats.distributions has an elaborate generic structure for the
distributions. When I started with this, it was relatively easy
because it was just "coding". I knew maybe 10 to 15 distributions out
of the around 90 and didn't need to go through all of them or
understand the distribution specific parts.

When I tried to find parameters for the test suite for each
distribution, I started to go through them individually, specifically
which distributions have integer parameters. Then I realized that none
of the distribution complains when I feed it a real (non-integer)
number, then I just did random search until I found parameters that
worked.

Over time we worked our way through some of the individual
distribution, either because there were bugs, or because other
developers were interested in them or because they showed up on the
mailing lists.

Some years ago Ralf rewrote the automatic docstring generation, which
allow now for better distribution specific docstrings. So, the way
would be open to incorporate more distribution specific information,
but someone needs to know what a distribution is supposed to be.

Also, for some purposes we don't really care about the initial history
of a distribution, for example pymvpa and some proprietary software
packages I looked at just try to find the best fitting distribution
given the data, independent of what the interpretation of the
distribution was when it was initially developed, for statistical
tests, for queuing models, extreme value analysis or whichever.

To most points below

parameterization in stats.distributions versus text books

All continuous distributions have a generic treatment of loc and scale
independent of whether this is part of a standard definition of the
distribution. (discrete distributions only have a loc shift.)
(location-scale families of distributions)

dist.cdf((x-loc)/scale) = dist.cdf(x, loc=loc, scale=scale)
the standard distributions have loc=0, scale=1
pdf and other methods follow from this

every other parameter besides loc and scale is called a shape
parameter, say theta, so we have

dist.cdf((x-loc)/scale; theta) = dist.cdf(x, *theta, loc=loc,
scale=scale)  (requires python >=2.6 :)

many distributions like normal and exponential don't have a shape
parameter, just loc and scale

many "standard" definitions of distributions don't include loc and
scale as separate parameters, and scale, or a function of scale, is
often the "standard" parameter, see your example of the exponential
distribution below

http://en.wikipedia.org/wiki/Exponential_distribution
standard definition has lambda as parameter, with interpretation as
rate parameter.
If we look at the cdf = 1 − exp{−λx), then lambda just multiplies x
(instead of x/scale), so the lambda in the standard definition of
exponential is just our 1./scale, or scale=1./lambda

Sometimes the parameterization in stats.distributions is "a bit
difficult" to translate to the standard parameterization, example
lognormal that regularly raises questions.

The documentation should be improved wherever possible, some parts I
might never have read very carefully, other parts I might interpret in
the "right way" even if it's not clear as general documentation for
someone that isn't familiar with the details.

>>
>> 1)
>>
>> What is actually the shape parameter?  Let me include some references
>> to show my confusion here.
>>
>> In expon it does not seem to exist:
>>
>>
>> https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L2770
>>
>> Then, in Erlang it is called 'n'. I suppose this would mean the number
>> of stages. So in Erlang, why then is the scale parameter corresponding
>> to the shape? BTW: should the scale in the erlang dist dosctring not
>> be explained?

Not sure I understand the first part. I never looked at Erlang until recently.

>>
>> Then, from the gamma dist I learn the following:
>>
>>
>> https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L3382
>>
>> So that would mean that in the expon dist the shape is set to 1.
>>
>> Then, here:
>>
>>
>> http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.erlang.html
>>
>> it states that the shape parameter should be an int, but in the
>> examples section it is set to 0.9, i.e., the documentation states
>> this:
>>
>> >>> from scipy.stats import erlang
>> >>> numargs = erlang.numargs
>> >>> [ n ] = [0.9,] * numargs
>> >>> rv = erlang(n)
>>
>> from which I infer that the shape is set to 0.9.

these are generic template numbers and could be replaced by
distribution specific docstrings

>>
>> All in all, I don't quite know what to expect with regard to the use
>> and purpose of shape.
>>
>> Is the shape parameter explained somewhere explicitly? if not,
>> wouldn't the stats tutorial be the best place? Who is the author of
>> this doc? How can I help change it?

online editing is the easiest.

I wrote the stats tutorial a long time ago, and it contains the
description of individual distributions written by Travis.
I haven't looked at the overall documentation for the distributions in a while.

Suggestions, or, even better, direct improvements in the doc editor or
with pull request would be very welcome.

>>
>> 2)
>>
>> Would it be a good idea to make the use of the loc and scale parameter
>> explicit in the doc strings of the distributions? I recall that, as a
>> first time user, I had no clue what they meant, and that it took some
>> struggling and searching to figure out what they came down to.
>> Besides, the doc strings are not allways complete. For instance, this
>> is the string for the epx distribution:
>>
>> The probability density function for `expon` is::
>>
>>        expon.pdf(x) = exp(-x)
>>
>>    for ``x >= 0``.
>>
>>    The scale parameter is equal to ``scale = 1.0 / lambda``.
>>
>> So, what is lambda here? Is it: pdf(x) = lambda * exp(-x lambda), or
>> is it pdf(x) = exp(-x/lambda)/lamda? After some experimentation I
>> found out, but the documentation is not explicit enough in my opinion.
>> Suppose we would restate it like this:
>>
>> cdf(x) = 1. - exp( -(x-loc)/scale).
>>
>> Then I think it would be clear immediately, and also
>> interpretation-free. Likewise for other distributions.

As above,
(I will have to browse the documentation, to be able to comment on
specific items.)

I hope the description above helps for this, and we can keep going to
clear this up and improve the documentation.

>>
>> 3)
>> I am really willing to help improve stats and the documentation at
>> points more consistent, but I don't quite know where to start.  In the
>> process I raise all these points. Is this list the best place, or
>> should I send my comments to Josef (?)?
>
>
> I'd prefer if the conversations stayed on list.
>
> FWIW, I'm really glad you are stepping up to help Josef out here. I am
> somewhat familiar with the stats code and the internals, but I still
> struggle with it at times. Anything that can be done to make this more
> user-friendly from documentation to refactoring would be very welcome.

I fully agree with Skipper.

Nicky, thanks for looking into this.

Josef

>
> Skipper
>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>