[SciPy-Dev] scipy.stats: algorithm to for ticket 1493

Sun May 13 16:34:15 EDT 2012

Hi Josef,

Some time ago we discussed an algorithm to find x such that cdf(x) = q
for given q, that is, a generic algorithm to solve the ppf. In the
mail below you propose to set xa and xb to good initial values, but
this is not a simple task. Besides this, xa and xb are solely used in
the ppf compution. This made me think about an algorithm that avoid
the use of xa and xb altogether. I came up with the algorithm below.
The included test cases show that it works really well. What is your
opinion?

In case motivation is required, I shift left and right such that
eventually cdf(left) < q < cdf(right). I update by a factor of 10. I
rather don't spend a lot of time on the while loop, but prefer to
leave the actual solving to brentq. This algo homes in very fast, so
it is better to rely on brentq to do the fast searching. Hence, I
prefer to take 10 rather than 2 or 3 at a growth factor. Just for fun
I tried 1000 as a factor. This works also well.

Do you perhaps have more challenging cases?  Once we have a set of
tests, I'll try to set up a performance test.

Once we are satisfied I'll make a pull request.

I attached the code. BTW. I suppose that as long as we are
experimenting with algorithms/code there is no reason to make a pull
request.

Nicky

On 26 April 2012 05:15,  <josef.pktd at gmail.com> wrote:
> On Wed, Apr 25, 2012 at 3:49 PM,  <josef.pktd at gmail.com> wrote:
>> On Wed, Apr 25, 2012 at 3:21 PM, nicky van foreest <vanforeest at gmail.com> wrote:
>>>>>>>> The difficult cases will be where cdf also doesn't exist and we need
>>>>>>>> to get it through integrate.quad, but I don't remember which
>>>>>>>> distribution is a good case.
>>>>>
>>>>> This case is harder indeed. (I assume you mean by 'not exist' that
>>>>> there is no closed form expression for the cdf, like the normal
>>>>> distribution). Computing the ppf would involve calling quad a lot of
>>>>> times. This is wasteful especially since the computation of cdf(b)
>>>>> includes the computation of cdf(a) for a < b, supposing that quad runs
>>>>> from -np.inf to b. We could repair this by computing cdf(b) = cdf(a) +
>>>>> quad(f, a, b), assuming that cdf(a) has been computed already.
>>>>> (perhaps I am not clear enough here. If so, let me know.)
>>>>
>>>> not exists = not defined as _cdf method  could also be scipy.special
>>>> if there are no closed form expressions
>>>
>>> I see, sure.
>>>
>>>>>>> I just think that we are not able to reach the q=0, q=1 boundaries,
>>>>>>> since for some distributions we will run into other numerical
>>>>>>> problems. And I'm curious how far we can get with this.
>>>>>
>>>>> I completely missed to include a test on the obvious cases q >= 1. -
>>>>> np.finfo(float).eps and q <= np.finfo(float).eps. It is now in the
>>>>> attached file.
>>>>
>>>>>>> findppf(stats.expon, 1e-30)
>>>> -6.3593574850511882e-13
>>>
>>> This result shows actually that xa and xb are necessary to include in
>>> the specification of the distribution. The exponential distribution is
>>> (usually) defined only on [0, \infty) not on the negative numbers. The
>>> result above is negative though. This is of course a simple
>>> consequence of calling brentq. From a user's perspective, though, I
>>> would become very suspicious about this negative result.
>>
>> good argument to clean up xa, xb
>>
>>>
>>>> The right answer should be dist.b for q=numerically 1, lower support
>>>> point is dist.a but I don't see when we would need it.
>>>
>>> I agree, provided xa and xb are always properly defined. But then,
>>> (just to be nitpicking), the definition of expon does not set xa and
>>> xb explicitly. Hence xa = -10, and this is somewhat undesirable, given
>>> the negative value above.
>>>
>>>>>
>>>>> The simultaneous updating of left and right in the previous algo is
>>>>> wrong. Suppose for instance that cdf(left) < cdf(right) < q. Then both
>>>>> left and right would `move to the left'. This is clearly wrong. The
>>>>> included code should be better.
>>>>
>>>> would move to the *right* ?
>>>
>>> Sure.
>>>
>>>>
>>>> I thought the original was a nice trick, we can shift both left and
>>>> right since we know it has to be in that direction, the cut of range
>>>> cannot contain the answer.
>>>>
>>>> Or do I miss the point?
>>>
>>> No, you are right. When I wrote this at first, I also thought about
>>> the point you bring up here. Then, I was somewhat dissatisfied with
>>> calling the while loop twice (suppose the left bound requires
>>> updating, then certainly the second while loop (to update the right
>>> bound) is unnecessary, and calling cdf(right) is useless). While
>>> trying to fix this, I forgot about my initial ideas...
>>>
>>>>
>>>>>
>>>>> With regard to the values of xb and xa. Can a `ordinary' user change
>>>>> these? If so, then the ppf finder should include some protection in my
>>>>> opinion. If not, the user will get an error that brentq has not the
>>>>> right limits, but this error might be somewhat unexpected. (What has
>>>>> brentq to do with finding the ppf?) Of course, looking at the code
>>>>> this is clear, but I expect most users will no do so.
>>>>
>>>> I don't think  `ordinary' users should touch xa, xb, but they could.
>>>> Except for getting around the limitation in this ticket there is no
>>>> reason to change xa, xb, so we could make them private _xa, _xb
>>>> instead.
>>>
>>> I think that would be better. Thus, the developer that subclasses
>>> rv_continuous should set _xa and _xb properly.
>>>
>>>>> The code contains two choices about how to handle xa and xb. Do you
>>>>> have any preference?
>>>>
>>>> I don't really like choice 1, because it removes the use of the
>>>> predefined xa, xb. On the other hand, with this extension, xa and xb
>>>> wouldn't be really necessary anymore.
>>>
>>> In view of your example with findppf(expon(1e-30)) I prefer to use _xa and _xb.
>>>
>>>>
>>>> another possibility would be to try except brentq with xa, xb first
>>>> and get most cases, and switch to your version if needed. I'm not sure
>>>> xa, xb are defined well enough that it's worth to go this route,
>>>> though.
>>>
>>> I think that this makes the most sense. The definition of the class
>>> should include sensible values of xa and xb.
>>>
>>> All in all, I would like to make the following proposal to resolve the
>>> ticket in a generic way.
>>>
>>> 1) xa and xb should become private class members _xa and _xb
>>> 2) _xa and _xb should be given proper values in the class definition,
>>> e.g. expon._xa = 0 and expon._xb = 30., since exp(-30) = 9.35e-14.
>>> 3) given a quantile q in the ppf function, include a test on _cdf(_xa)
>>> <= q <= _cdf(_xb). If this fails, return an exception with some text
>>> that tells that either _cdf(_xa) > q or _cdf(_xb) < q.
>>>
>>> Given your comments I actually favor all this searching for left and
>>> right not that much anymore. It is generic, but it places the
>>> responsibility of good code at the wrong place.
>>
>> 3) I prefer your expanding the search to raising an exception to the
>> user. Note also that your 3) is inconsistent with 1). If a user
>> visible exception is raised, then the user needs to change xa or xb,
>> so it shouldn't be private. That's the current situation (except for a
>> more cryptic message).
>>
>> 2) I'm all in favor, especially for one-side bound distributions,
>> where it should be easy to go through those. There might be a few
>> where the bound moves with the shape, but the only one I remember is
>> genextreme and that has an explicit _ppf
>>
>> So I would prefer 1), 2) and your new enhanced generic _ppf
>
> forgot to mention
>
> the main reason that I like your expanding search space is that the
> shape of the distribution can change a lot. Even if we set xa, xb to
> reasonable values for likely shape parameters they won't be good
> enough for others, as in the original ticket
>
>>>> stats.invgauss.stats(2)
> (array(2.0), array(8.0))
>>>> stats.invgauss.stats(7)
> (array(7.0), array(343.0))
>>>> stats.invgauss.stats(20)
> (array(20.0), array(8000.0))
>>>> stats.invgauss.stats(100)
> (array(100.0), array(1000000.0))
>>>> stats.invgauss.cdf(1000, 100)
> 0.98335562794321207
>
>>>> findppf(stats.invgauss, 0.99, 100)
> 1926.520850319389
>>>> findppf(stats.invgauss, 0.999, 100)
> 13928.012903371644
>>>> findppf(stats.invgauss, 0.999, 1)
> 8.3548649291400938
> ---------
>
> to get a rough idea:
> for xa, xb and a finite bound either left or right, all have generic
> xa=-10 or xb=10
>
>>>> dist_cont = [getattr(stats.distributions, dname)  for dname in dir(stats.distributions) if isinstance(getattr(stats.distributions, dname), stats.distributions.rv_continuous)]
>
>>>> left = [(d.name, d.a, d.xa) for d in dist_cont if not np.isneginf(d.a)]
>>>> pprint(left)
> [('alpha', 0.0, -10.0),
>  ('anglit', -0.78539816339744828, -10.0),
>  ('arcsine', 0.0, -10.0),
>  ('beta', 0.0, -10.0),
>  ('betaprime', 0.0, -10.0),
>  ('bradford', 0.0, -10.0),
>  ('burr', 0.0, -10.0),
>  ('chi', 0.0, -10.0),
>  ('chi2', 0.0, -10.0),
>  ('cosine', -3.1415926535897931, -10.0),
>  ('erlang', 0.0, -10.0),
>  ('expon', 0.0, 0),
>  ('exponpow', 0.0, -10.0),
>  ('exponweib', 0.0, -10.0),
>  ('f', 0.0, -10.0),
>  ('fatiguelife', 0.0, -10.0),
>  ('fisk', 0.0, -10.0),
>  ('foldcauchy', 0.0, -10.0),
>  ('foldnorm', 0.0, -10.0),
>  ('frechet_r', 0.0, -10.0),
>  ('gamma', 0.0, -10.0),
>  ('gausshyper', 0.0, -10.0),
>  ('genexpon', 0.0, -10.0),
>  ('gengamma', 0.0, -10.0),
>  ('genhalflogistic', 0.0, -10.0),
>  ('genpareto', 0.0, -10.0),
>  ('gilbrat', 0.0, -10.0),
>  ('gompertz', 0.0, -10.0),
>  ('halfcauchy', 0.0, -10.0),
>  ('halflogistic', 0.0, -10.0),
>  ('halfnorm', 0.0, -10.0),
>  ('invgamma', 0.0, -10.0),
>  ('invgauss', 0.0, -10.0),
>  ('invnorm', 0.0, -10.0),
>  ('invweibull', 0, -10.0),
>  ('johnsonb', 0.0, -10.0),
>  ('ksone', 0.0, -10.0),
>  ('kstwobign', 0.0, -10.0),
>  ('levy', 0.0, -10.0),
>  ('loglaplace', 0.0, -10.0),
>  ('lognorm', 0.0, -10.0),
>  ('lomax', 0.0, -10.0),
>  ('maxwell', 0.0, -10.0),
>  ('mielke', 0.0, -10.0),
>  ('nakagami', 0.0, -10.0),
>  ('ncf', 0.0, -10.0),
>  ('ncx2', 0.0, -10.0),
>  ('pareto', 1.0, -10.0),
>  ('powerlaw', 0.0, -10.0),
>  ('powerlognorm', 0.0, -10.0),
>  ('rayleigh', 0.0, -10.0),
>  ('rdist', -1.0, -10.0),
>  ('recipinvgauss', 0.0, -10.0),
>  ('rice', 0.0, -10.0),
>  ('semicircular', -1.0, -10.0),
>  ('triang', 0.0, -10.0),
>  ('truncexpon', 0.0, -10.0),
>  ('uniform', 0.0, -10.0),
>  ('wald', 0.0, -10.0),
>  ('weibull_min', 0.0, -10.0),
>  ('wrapcauchy', 0.0, -10.0)]
>
>>>> right = [(d.name, d.b, d.xb) for d in dist_cont if not np.isposinf(d.b)]
>>>> pprint(right)
> [('anglit', 0.78539816339744828, 10.0),
>  ('arcsine', 1.0, 10.0),
>  ('beta', 1.0, 10.0),
>  ('betaprime', 500.0, 10.0),
>  ('bradford', 1.0, 10.0),
>  ('cosine', 3.1415926535897931, 10.0),
>  ('frechet_l', 0.0, 10.0),
>  ('gausshyper', 1.0, 10.0),
>  ('johnsonb', 1.0, 10.0),
>  ('levy_l', 0.0, 10.0),
>  ('powerlaw', 1.0, 10.0),
>  ('rdist', 1.0, 10.0),
>  ('semicircular', 1.0, 10.0),
>  ('triang', 1.0, 10.0),
>  ('uniform', 1.0, 10.0),
>  ('weibull_max', 0.0, 10.0),
>  ('wrapcauchy', 6.2831853071795862, 10.0)]
>
> only pareto has both limits on the same side of zero
>
>>>> pprint ([(d.name, d.a, d.b) for d in dist_cont if d.a*d.b>0])
> [('pareto', 1.0, inf)]
>
>
> genextreme, and maybe one or two others, are missing because finite a,
> b are set in _argcheck
> vonmises is for circular and doesn't behave properly
>
> only two distributions define non-generic xa or xb
>
>>>> pprint ([(d.name, d.a, d.b, d.xa, d.xb) for d in dist_cont if not d.xa*d.xb==-100])
> [('foldcauchy', 0.0, inf, -10.0, 1000), ('recipinvgauss', 0.0, inf, -10.0, 50)]
>
> a pull request setting correct xa, xb would be very welcome
>
> Josef
>
>
>>
>> Josef
>>
>>>
>>> Nicky
>>> _______________________________________________
>>> SciPy-Dev mailing list
>>> SciPy-Dev at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-dev
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: findppf2.py
Type: application/octet-stream
Size: 1199 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20120513/c8a88f3b/attachment.obj>