[SciPy-Dev] scipy.stats: algorithm to for ticket 1493

Wed Apr 25 15:21:21 EDT 2012

>>>>> The difficult cases will be where cdf also doesn't exist and we need
>>>>> to get it through integrate.quad, but I don't remember which
>>>>> distribution is a good case.
>>
>> This case is harder indeed. (I assume you mean by 'not exist' that
>> there is no closed form expression for the cdf, like the normal
>> distribution). Computing the ppf would involve calling quad a lot of
>> times. This is wasteful especially since the computation of cdf(b)
>> includes the computation of cdf(a) for a < b, supposing that quad runs
>> from -np.inf to b. We could repair this by computing cdf(b) = cdf(a) +
>> quad(f, a, b), assuming that cdf(a) has been computed already.
>> (perhaps I am not clear enough here. If so, let me know.)
>
> not exists = not defined as _cdf method  could also be scipy.special
> if there are no closed form expressions

I see, sure.

>>>> I just think that we are not able to reach the q=0, q=1 boundaries,
>>>> since for some distributions we will run into other numerical
>>>> problems. And I'm curious how far we can get with this.
>>
>> I completely missed to include a test on the obvious cases q >= 1. -
>> np.finfo(float).eps and q <= np.finfo(float).eps. It is now in the
>> attached file.
>
>>>> findppf(stats.expon, 1e-30)
> -6.3593574850511882e-13

This result shows actually that xa and xb are necessary to include in
the specification of the distribution. The exponential distribution is
(usually) defined only on [0, \infty) not on the negative numbers. The
result above is negative though. This is of course a simple
consequence of calling brentq. From a user's perspective, though, I
would become very suspicious about this negative result.

> The right answer should be dist.b for q=numerically 1, lower support
> point is dist.a but I don't see when we would need it.

I agree, provided xa and xb are always properly defined. But then,
(just to be nitpicking), the definition of expon does not set xa and
xb explicitly. Hence xa = -10, and this is somewhat undesirable, given
the negative value above.

>>
>> The simultaneous updating of left and right in the previous algo is
>> wrong. Suppose for instance that cdf(left) < cdf(right) < q. Then both
>> left and right would `move to the left'. This is clearly wrong. The
>> included code should be better.
>
> would move to the *right* ?

Sure.

>
> I thought the original was a nice trick, we can shift both left and
> right since we know it has to be in that direction, the cut of range
> cannot contain the answer.
>
> Or do I miss the point?

No, you are right. When I wrote this at first, I also thought about
the point you bring up here. Then, I was somewhat dissatisfied with
calling the while loop twice (suppose the left bound requires
updating, then certainly the second while loop (to update the right
bound) is unnecessary, and calling cdf(right) is useless). While
trying to fix this, I forgot about my initial ideas...

>
>>
>> With regard to the values of xb and xa. Can a `ordinary' user change
>> these? If so, then the ppf finder should include some protection in my
>> opinion. If not, the user will get an error that brentq has not the
>> right limits, but this error might be somewhat unexpected. (What has
>> brentq to do with finding the ppf?) Of course, looking at the code
>> this is clear, but I expect most users will no do so.
>
> I don't think  `ordinary' users should touch xa, xb, but they could.
> Except for getting around the limitation in this ticket there is no
> reason to change xa, xb, so we could make them private _xa, _xb
> instead.

I think that would be better. Thus, the developer that subclasses
rv_continuous should set _xa and _xb properly.

>> The code contains two choices about how to handle xa and xb. Do you
>> have any preference?
>
> I don't really like choice 1, because it removes the use of the
> predefined xa, xb. On the other hand, with this extension, xa and xb
> wouldn't be really necessary anymore.

In view of your example with findppf(expon(1e-30)) I prefer to use _xa and _xb.

>
> another possibility would be to try except brentq with xa, xb first
> and get most cases, and switch to your version if needed. I'm not sure
> xa, xb are defined well enough that it's worth to go this route,
> though.

I think that this makes the most sense. The definition of the class
should include sensible values of xa and xb.

All in all, I would like to make the following proposal to resolve the
ticket in a generic way.

1) xa and xb should become private class members _xa and _xb
2) _xa and _xb should be given proper values in the class definition,
e.g. expon._xa = 0 and expon._xb = 30., since exp(-30) = 9.35e-14.
3) given a quantile q in the ppf function, include a test on _cdf(_xa)
<= q <= _cdf(_xb). If this fails, return an exception with some text
that tells that either _cdf(_xa) > q or _cdf(_xb) < q.

Given your comments I actually favor all this searching for left and
right not that much anymore. It is generic, but it places the
responsibility of good code at the wrong place.

Nicky