[SciPy-Dev] scipy.stats: some questions/points about distributions.py + reply on ticket 1493

Wed Apr 25 23:54:58 EDT 2012

On Wed, Apr 25, 2012 at 4:03 PM, nicky van foreest <vanforeest at gmail.com> wrote:
>>> 1:
>>>
>>> https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L436
>>
>> I never looked at this. It's not used anywhere.
>>
>>>
>>> Is this code "dead"? Within distributions.py it is not called. Nearly
>>> the same code is written here:
>>>
>>> https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L1180
>>
>> This is what is used for the generic ppf.
>
> Yes, sure. Sorry for confusing you. L1180 makes good sense. But since
> L1180 is there, there appears to be no good reason to include the code
> at L436.

I guess not, but because I never looked at it carefully, I don't know
if it might be useful for anything.

>
>
>>> 2:
>>>
>>> I have a similar point about:
>>>
>>> https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L358
>>>
>>> What is the use of this code? It is not called anywhere. Besides this,
>>> from our  discussion about ticket 1493, this function returns the
>>> centralized moments, while the "real" moment E(X^n) should be
>>> returned. Hence, the code is also not correct, i.e., not in line with
>>> the documentation.
>>
>> I think this and skew, kurtosis are internal functions for fit_start,
>> getting starting values for fit from the data, even if it's not used.
>> in general: For the calculations it might sometimes be nicer to
>> calculate central moments, and then convert them to non-central or the
>> other way around. I have some helper functions for this in statsmodels
>> and it is similarly used
>>
>> https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L1745
>>
>> (That's new code that I'm not so familiar with.)
>
> I actually saw this code, and have my doubts about whether this is the
> best way to compute the non-central moments. Suppose that the
> computation of the central moment involves quad(). Then indeed the
> computations at these lines don't require a new call to quad().
> However, there is a (slow) python for loop involved, the power
> function ** is called multiple times, and { n \choose k}  is computed.
> (BTW, can I safely assume you use Latex?). Calling quad() on x**k to
> compute E(X^k) might be just a fast, although I did not test this
> hunch. Anyway quad( lamdba x: x**k *_pdf(x)) reads much easier.

L1745 are necessary now because moments are still non-central, but
allow for non default loc and scale.
_munp still does the raw quad( lamdba x: x**k *_pdf(x)) or similar if
it is not specifically defined, i.e. can be calculated from explicit
_stats.

as aside: whenever I try to go through generic _stats and moments I
have a hard time to follow in what is going on in all the different
cases

I don't see where we could save much. I didn't go through the math.

>
>>
>>>
>>> 3:
>>>
>>> Suppose we would turn xa and xb into private atrributes _xa and _xb,
>>> then i suppose that
>>>
>>> https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L883
>>>
>>> requires updating.
>>
>> Yes, but no big loss I think,  given that it won't be needed anymore
>
> Oops. Your other mail convinced to do use _xa and _xb.... See my other mail.

not needed as public method, See the other thread
which would also need adjustments to the docstring

>
>>> 5:
>>>
>>> The definition of arr in
>>>
>>> https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L60
>>>
>>> does not add much (although it saves some characters at some points of
>>> the code), but makes it harder to read the code for novices like me.
>>> (I spent some time searching for a numpy function called arr, only to
>>> find out later that it was just a shorthand only used in the
>>> distribution.py module). Would it be a problem to replace such code by
>>> the proper numpy function?
>>
>> But then these novices would just read some piece code instead of
>> going through all 7000 lines looking for imports and redefinitions.
>> And I suffered the same way. :)
>
> I suppose you did :-)
>
>>
>> I don't have any problem with cleaning this up. I never checked if in
>> some cases with lot's of generic loops the namespace lookup would
>> significantly increase the runtime.
>
> Should it? I am not an expert on this, but I read in Langtangen's book
> that importing functions like so: from numpy import array, and so on,
> does not add much to the calling time of functions. However, if I am
> mistaken, please forget this point.

from numpy import asarray
or
from numpy import asarray as arr
doesn't make any difference, but I usually like full namespaces
np.asarray

>
>>
>>>
>>> 6:
>>>
>>> https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L538
>>>
>>> contains a typo. It should be Weisstein.
>>
>> should be fixed then
>
> Should this become a ticket, or is it too minor?

too minor for a ticket, but sneaking it into a pull request, or making
a separate pull request (to increase your Karma) would be useful

>
>>
>>>
>>> 7:
>>>
>>> https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L625
>>>
>>>
>>> This code gives me even a harder time than _argsreduce. I have to
>>> admit that I simply don't know what this code is trying to
>>> prevent/check/repair. Would you mind giving a hint?
>>
>> whats _argsreduce?
>
> Sorry, I meant __argcheck(). The code at
>
> https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L1195
>
> is not very simple to understand, at least not for me.

AFAICS, It's just a joint condition that all args are strictly
positive ( the default condition)
args[0] > 0 & args[1] > 0 & ....

my guess is the arr, asarray, is not necessary, and should be handled
already in the main methods.

Josef

>
>>
>> https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L625
>>
>> This has been rewritten by Per Brodtkorb.
>> It is used in most methods to get the goodargs with which the
>> distribution specific method is called.
>>
>> example ppf https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L1524
>>
>> first we are building the conditions for valid, good arguments.
>> boundaries are filled, invalid arguments get nans.
>> What's left over are the goodargs, the values of the method arguments
>> for which we need to calculate the actual results.
>> So we need to broadcast and select those arguments. -> argsreduce
>> The distribution specific or generic ._ppf is then called with 1d
>> arrays (of the same shape IIRC) of goodargs.
>>
>> then we can "place" the calculated values into the results arrays,
>> next to the nans and boundaries.
>>
>> I hope that helps
>
> I'll try to understand it again.
>
> Thanks for your hints.
>
>>
>> Thanks,
>>
>> Josef
>>
>>>
>>> Nicky
>>> _______________________________________________
>>> SciPy-Dev mailing list
>>> SciPy-Dev at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev