[SciPy-Dev] scipy.stats: algorithm to for ticket 1493

nicky van foreest vanforeest at gmail.com
Mon Apr 23 16:18:01 EDT 2012


I'll get back to this tomorrow evening. I promised to finish something
today. With respect to computing the sum of two random variables: this
turns out to be quite a challenge. To resolve this I decided to
formulate it as a possible assignment for students... Thus, this takes
somewhat longer than I expected. On the other hand, hopefully it leads
to some more scipy converts...

On 23 April 2012 22:04,  <josef.pktd at gmail.com> wrote:
> On Mon, Apr 23, 2012 at 2:58 PM, nicky van foreest <vanforeest at gmail.com> wrote:
>>>> for xa, xb it doesn't matter whether they are larger or smaller than
>>>>> zero, so I don't think we need a special check
>>
>> I think it does, for suppose that in the algo left = xa = 0.5 (because
>> the user has been fiddling with xa) and cdf(xa) > q. Then  setting
>> left = 2*left will only worsen the problem. Or do I miss something?
>
> True, however I don't think we have any predefined xa and xb that both
> are strictly positive or negative values.
> pareto is the only distribution bounded away from zero that I know and
> it has xa = -10
>
>>
>>>>> it looks good in a few more example cases.
>>
>> I found another small bug, please see the included code.
>
> later today
>
>>
>>>>>
>>>>> The difficult cases will be where cdf also doesn't exist and we need
>>>>> to get it through integrate.quad, but I don't remember which
>>>>> distribution is a good case.
>>
>> This case is harder indeed. (I assume you mean by 'not exist' that
>> there is no closed form expression for the cdf, like the normal
>> distribution). Computing the ppf would involve calling quad a lot of
>> times. This is wasteful especially since the computation of cdf(b)
>> includes the computation of cdf(a) for a < b, supposing that quad runs
>> from -np.inf to b. We could repair this by computing cdf(b) = cdf(a) +
>> quad(f, a, b), assuming that cdf(a) has been computed already.
>> (perhaps I am not clear enough here. If so, let me know.)
>
> not exists = not defined as _cdf method  could also be scipy.special
> if there are no closed form expressions
>
> quad should run from dist.a to x, I guess, dist.a might be -inf
>
>>
>>>> I just think that we are not able to reach the q=0, q=1 boundaries,
>>>> since for some distributions we will run into other numerical
>>>> problems. And I'm curious how far we can get with this.
>>
>> I completely missed to include a test on the obvious cases q >= 1. -
>> np.finfo(float).eps and q <= np.finfo(float).eps. It is now in the
>> attached file.
>
>>>> findppf(stats.expon, 1e-30)
> -6.3593574850511882e-13
>
> lower bound q can be small and won't run into problems with being 0,
> until 1e-300?
>
> The right answer should be dist.b for q=numerically 1, lower support
> point is dist.a but I don't see when we would need it.
>
>>
>>>> Similarly, I don't know whether the default xa and xb are good. I
>>>> changed them for a few distributions, but only where I saw obvious
>>>> improvements.
>>
>> I also have no clue what would be good values in general. The choices
>> seems reasonable from a practical point of view...
>>
>>>>> Note: I removed the scale in your example, because internal _ppf works
>>>>> on the standard distribution, loc=0, scale=1. loc and scale are added
>>>>> generically in .ppf
>>
>> Thanks. I included also **kwds so that I can pass scale = 10 or
>> something like this. Once all works as it should, I'll try to convert
>> the code such that it fits nicely in distributions.py.
>
> with self instead of dist, it should already have the signature about
> right, no **kwds I assume
>
>>
>> The simultaneous updating of left and right in the previous algo is
>> wrong. Suppose for instance that cdf(left) < cdf(right) < q. Then both
>> left and right would `move to the left'. This is clearly wrong. The
>> included code should be better.
>
> would move to the *right* ?
>
> I thought the original was a nice trick, we can shift both left and
> right since we know it has to be in that direction, the cut of range
> cannot contain the answer.
>
> Or do I miss the point?
>
>>
>> With regard to the values of xb and xa. Can a `ordinary' user change
>> these? If so, then the ppf finder should include some protection in my
>> opinion. If not, the user will get an error that brentq has not the
>> right limits, but this error might be somewhat unexpected. (What has
>> brentq to do with finding the ppf?) Of course, looking at the code
>> this is clear, but I expect most users will no do so.
>
> I don't think  `ordinary' users should touch xa, xb, but they could.
> Except for getting around the limitation in this ticket there is no
> reason to change xa, xb, so we could make them private _xa, _xb
> instead.
>
>
>>
>> The code contains two choices about how to handle xa and xb. Do you
>> have any preference?
>
> I don't really like choice 1, because it removes the use of the
> predefined xa, xb. On the other hand, with this extension, xa and xb
> wouldn't be really necessary anymore.
>
> another possibility would be to try except brentq with xa, xb first
> and get most cases, and switch to your version if needed. I'm not sure
> xa, xb are defined well enough that it's worth to go this route,
> though.
>
> Thanks for working on this,
>
> Josef
>
>>
>> Thanks for your feedback. Very helpful.
>>
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev



More information about the SciPy-Dev mailing list