[SciPy-Dev] scipy.stats: algorithm to for ticket 1493

Sun May 13 17:57:21 EDT 2012

On Sun, May 13, 2012 at 4:34 PM, nicky van foreest <vanforeest at gmail.com> wrote:
> Hi Josef,
>
> Some time ago we discussed an algorithm to find x such that cdf(x) = q
> for given q, that is, a generic algorithm to solve the ppf. In the
> mail below you propose to set xa and xb to good initial values, but
> this is not a simple task. Besides this, xa and xb are solely used in
> the ppf compution. This made me think about an algorithm that avoid
> the use of xa and xb altogether. I came up with the algorithm below.
> The included test cases show that it works really well. What is your
> opinion?
>
> In case motivation is required, I shift left and right such that
> eventually cdf(left) < q < cdf(right). I update by a factor of 10. I
> rather don't spend a lot of time on the while loop, but prefer to
> leave the actual solving to brentq. This algo homes in very fast, so
> it is better to rely on brentq to do the fast searching. Hence, I
> prefer to take 10 rather than 2 or 3 at a growth factor. Just for fun
> I tried 1000 as a factor. This works also well.
>
> Do you perhaps have more challenging cases?  Once we have a set of
> tests, I'll try to set up a performance test.

brentq seems to work very well

one case I was worried about, but works well:

>>> p,r = optimize.brentq(lambda x: np.minimum(1, np.maximum(x,0)) -1e-30, -1000, 1000, full_output=1)
>>> p
-1.2079226507921669e-12
>>> r.iterations
53
>>> r.function_calls
54
>>> p,r = optimize.brentq(lambda x: np.minimum(1, np.maximum(x,0)) -1e-30, -100, 100, full_output=1)
>>> p
-9.0949470177290643e-13
>>> r.iterations
50
>>> r.function_calls
51
>>> p,r = optimize.brentq(lambda x: np.minimum(1, np.maximum(x,0)) -1e-30, -1000, 10, full_output=1)
>>> r.iterations
71
>>> r.function_calls
72

aside: if q = 0 or q=1, then brentq will find something outside the
support, but this should be handled by the ppf generic code

class Fake(object):
    #actually uniform
    def cdf(wlf, x):
        return np.minimum(1, np.maximum(1000+x,0))

q = 1e-6
d = Fake()
sol =  findppf(d, q)
print sol, q, d.cdf(sol)

-999.999999 1e-06 9.99999997475e-07

Right now I cannot think of another case that would be difficult.

I don't see anything yet to criticize in your latest version :(

Josef

>
> Once we are satisfied I'll make a pull request.
>
> I attached the code. BTW. I suppose that as long as we are
> experimenting with algorithms/code there is no reason to make a pull
> request.
>
> Nicky
>
>
>
> On 26 April 2012 05:15,  <josef.pktd at gmail.com> wrote:
>> On Wed, Apr 25, 2012 at 3:49 PM,  <josef.pktd at gmail.com> wrote:
>>> On Wed, Apr 25, 2012 at 3:21 PM, nicky van foreest <vanforeest at gmail.com> wrote:
>>>>>>>>> The difficult cases will be where cdf also doesn't exist and we need
>>>>>>>>> to get it through integrate.quad, but I don't remember which
>>>>>>>>> distribution is a good case.
>>>>>>
>>>>>> This case is harder indeed. (I assume you mean by 'not exist' that
>>>>>> there is no closed form expression for the cdf, like the normal
>>>>>> distribution). Computing the ppf would involve calling quad a lot of
>>>>>> times. This is wasteful especially since the computation of cdf(b)
>>>>>> includes the computation of cdf(a) for a < b, supposing that quad runs
>>>>>> from -np.inf to b. We could repair this by computing cdf(b) = cdf(a) +
>>>>>> quad(f, a, b), assuming that cdf(a) has been computed already.
>>>>>> (perhaps I am not clear enough here. If so, let me know.)
>>>>>
>>>>> not exists = not defined as _cdf method  could also be scipy.special
>>>>> if there are no closed form expressions
>>>>
>>>> I see, sure.
>>>>
>>>>>>>> I just think that we are not able to reach the q=0, q=1 boundaries,
>>>>>>>> since for some distributions we will run into other numerical
>>>>>>>> problems. And I'm curious how far we can get with this.
>>>>>>
>>>>>> I completely missed to include a test on the obvious cases q >= 1. -
>>>>>> np.finfo(float).eps and q <= np.finfo(float).eps. It is now in the
>>>>>> attached file.
>>>>>
>>>>>>>> findppf(stats.expon, 1e-30)
>>>>> -6.3593574850511882e-13
>>>>
>>>> This result shows actually that xa and xb are necessary to include in
>>>> the specification of the distribution. The exponential distribution is
>>>> (usually) defined only on [0, \infty) not on the negative numbers. The
>>>> result above is negative though. This is of course a simple
>>>> consequence of calling brentq. From a user's perspective, though, I
>>>> would become very suspicious about this negative result.
>>>
>>> good argument to clean up xa, xb
>>>
>>>>
>>>>> The right answer should be dist.b for q=numerically 1, lower support
>>>>> point is dist.a but I don't see when we would need it.
>>>>
>>>> I agree, provided xa and xb are always properly defined. But then,
>>>> (just to be nitpicking), the definition of expon does not set xa and
>>>> xb explicitly. Hence xa = -10, and this is somewhat undesirable, given
>>>> the negative value above.
>>>>
>>>>>>
>>>>>> The simultaneous updating of left and right in the previous algo is
>>>>>> wrong. Suppose for instance that cdf(left) < cdf(right) < q. Then both
>>>>>> left and right would `move to the left'. This is clearly wrong. The
>>>>>> included code should be better.
>>>>>
>>>>> would move to the *right* ?
>>>>
>>>> Sure.
>>>>
>>>>>
>>>>> I thought the original was a nice trick, we can shift both left and
>>>>> right since we know it has to be in that direction, the cut of range
>>>>> cannot contain the answer.
>>>>>
>>>>> Or do I miss the point?
>>>>
>>>> No, you are right. When I wrote this at first, I also thought about
>>>> the point you bring up here. Then, I was somewhat dissatisfied with
>>>> calling the while loop twice (suppose the left bound requires
>>>> updating, then certainly the second while loop (to update the right
>>>> bound) is unnecessary, and calling cdf(right) is useless). While
>>>> trying to fix this, I forgot about my initial ideas...
>>>>
>>>>>
>>>>>>
>>>>>> With regard to the values of xb and xa. Can a `ordinary' user change
>>>>>> these? If so, then the ppf finder should include some protection in my
>>>>>> opinion. If not, the user will get an error that brentq has not the
>>>>>> right limits, but this error might be somewhat unexpected. (What has
>>>>>> brentq to do with finding the ppf?) Of course, looking at the code
>>>>>> this is clear, but I expect most users will no do so.
>>>>>
>>>>> I don't think  `ordinary' users should touch xa, xb, but they could.
>>>>> Except for getting around the limitation in this ticket there is no
>>>>> reason to change xa, xb, so we could make them private _xa, _xb
>>>>> instead.
>>>>
>>>> I think that would be better. Thus, the developer that subclasses
>>>> rv_continuous should set _xa and _xb properly.
>>>>
>>>>>> The code contains two choices about how to handle xa and xb. Do you
>>>>>> have any preference?
>>>>>
>>>>> I don't really like choice 1, because it removes the use of the
>>>>> predefined xa, xb. On the other hand, with this extension, xa and xb
>>>>> wouldn't be really necessary anymore.
>>>>
>>>> In view of your example with findppf(expon(1e-30)) I prefer to use _xa and _xb.
>>>>
>>>>>
>>>>> another possibility would be to try except brentq with xa, xb first
>>>>> and get most cases, and switch to your version if needed. I'm not sure
>>>>> xa, xb are defined well enough that it's worth to go this route,
>>>>> though.
>>>>
>>>> I think that this makes the most sense. The definition of the class
>>>> should include sensible values of xa and xb.
>>>>
>>>> All in all, I would like to make the following proposal to resolve the
>>>> ticket in a generic way.
>>>>
>>>> 1) xa and xb should become private class members _xa and _xb
>>>> 2) _xa and _xb should be given proper values in the class definition,
>>>> e.g. expon._xa = 0 and expon._xb = 30., since exp(-30) = 9.35e-14.
>>>> 3) given a quantile q in the ppf function, include a test on _cdf(_xa)
>>>> <= q <= _cdf(_xb). If this fails, return an exception with some text
>>>> that tells that either _cdf(_xa) > q or _cdf(_xb) < q.
>>>>
>>>> Given your comments I actually favor all this searching for left and
>>>> right not that much anymore. It is generic, but it places the
>>>> responsibility of good code at the wrong place.
>>>
>>> 3) I prefer your expanding the search to raising an exception to the
>>> user. Note also that your 3) is inconsistent with 1). If a user
>>> visible exception is raised, then the user needs to change xa or xb,
>>> so it shouldn't be private. That's the current situation (except for a
>>> more cryptic message).
>>>
>>> 2) I'm all in favor, especially for one-side bound distributions,
>>> where it should be easy to go through those. There might be a few
>>> where the bound moves with the shape, but the only one I remember is
>>> genextreme and that has an explicit _ppf
>>>
>>> So I would prefer 1), 2) and your new enhanced generic _ppf
>>
>> forgot to mention
>>
>> the main reason that I like your expanding search space is that the
>> shape of the distribution can change a lot. Even if we set xa, xb to
>> reasonable values for likely shape parameters they won't be good
>> enough for others, as in the original ticket
>>
>>>>> stats.invgauss.stats(2)
>> (array(2.0), array(8.0))
>>>>> stats.invgauss.stats(7)
>> (array(7.0), array(343.0))
>>>>> stats.invgauss.stats(20)
>> (array(20.0), array(8000.0))
>>>>> stats.invgauss.stats(100)
>> (array(100.0), array(1000000.0))
>>>>> stats.invgauss.cdf(1000, 100)
>> 0.98335562794321207
>>
>>>>> findppf(stats.invgauss, 0.99, 100)
>> 1926.520850319389
>>>>> findppf(stats.invgauss, 0.999, 100)
>> 13928.012903371644
>>>>> findppf(stats.invgauss, 0.999, 1)
>> 8.3548649291400938
>> ---------
>>
>> to get a rough idea:
>> for xa, xb and a finite bound either left or right, all have generic
>> xa=-10 or xb=10
>>
>>>>> dist_cont = [getattr(stats.distributions, dname)  for dname in dir(stats.distributions) if isinstance(getattr(stats.distributions, dname), stats.distributions.rv_continuous)]
>>
>>>>> left = [(d.name, d.a, d.xa) for d in dist_cont if not np.isneginf(d.a)]
>>>>> pprint(left)
>> [('alpha', 0.0, -10.0),
>>  ('anglit', -0.78539816339744828, -10.0),
>>  ('arcsine', 0.0, -10.0),
>>  ('beta', 0.0, -10.0),
>>  ('betaprime', 0.0, -10.0),
>>  ('bradford', 0.0, -10.0),
>>  ('burr', 0.0, -10.0),
>>  ('chi', 0.0, -10.0),
>>  ('chi2', 0.0, -10.0),
>>  ('cosine', -3.1415926535897931, -10.0),
>>  ('erlang', 0.0, -10.0),
>>  ('expon', 0.0, 0),
>>  ('exponpow', 0.0, -10.0),
>>  ('exponweib', 0.0, -10.0),
>>  ('f', 0.0, -10.0),
>>  ('fatiguelife', 0.0, -10.0),
>>  ('fisk', 0.0, -10.0),
>>  ('foldcauchy', 0.0, -10.0),
>>  ('foldnorm', 0.0, -10.0),
>>  ('frechet_r', 0.0, -10.0),
>>  ('gamma', 0.0, -10.0),
>>  ('gausshyper', 0.0, -10.0),
>>  ('genexpon', 0.0, -10.0),
>>  ('gengamma', 0.0, -10.0),
>>  ('genhalflogistic', 0.0, -10.0),
>>  ('genpareto', 0.0, -10.0),
>>  ('gilbrat', 0.0, -10.0),
>>  ('gompertz', 0.0, -10.0),
>>  ('halfcauchy', 0.0, -10.0),
>>  ('halflogistic', 0.0, -10.0),
>>  ('halfnorm', 0.0, -10.0),
>>  ('invgamma', 0.0, -10.0),
>>  ('invgauss', 0.0, -10.0),
>>  ('invnorm', 0.0, -10.0),
>>  ('invweibull', 0, -10.0),
>>  ('johnsonb', 0.0, -10.0),
>>  ('ksone', 0.0, -10.0),
>>  ('kstwobign', 0.0, -10.0),
>>  ('levy', 0.0, -10.0),
>>  ('loglaplace', 0.0, -10.0),
>>  ('lognorm', 0.0, -10.0),
>>  ('lomax', 0.0, -10.0),
>>  ('maxwell', 0.0, -10.0),
>>  ('mielke', 0.0, -10.0),
>>  ('nakagami', 0.0, -10.0),
>>  ('ncf', 0.0, -10.0),
>>  ('ncx2', 0.0, -10.0),
>>  ('pareto', 1.0, -10.0),
>>  ('powerlaw', 0.0, -10.0),
>>  ('powerlognorm', 0.0, -10.0),
>>  ('rayleigh', 0.0, -10.0),
>>  ('rdist', -1.0, -10.0),
>>  ('recipinvgauss', 0.0, -10.0),
>>  ('rice', 0.0, -10.0),
>>  ('semicircular', -1.0, -10.0),
>>  ('triang', 0.0, -10.0),
>>  ('truncexpon', 0.0, -10.0),
>>  ('uniform', 0.0, -10.0),
>>  ('wald', 0.0, -10.0),
>>  ('weibull_min', 0.0, -10.0),
>>  ('wrapcauchy', 0.0, -10.0)]
>>
>>>>> right = [(d.name, d.b, d.xb) for d in dist_cont if not np.isposinf(d.b)]
>>>>> pprint(right)
>> [('anglit', 0.78539816339744828, 10.0),
>>  ('arcsine', 1.0, 10.0),
>>  ('beta', 1.0, 10.0),
>>  ('betaprime', 500.0, 10.0),
>>  ('bradford', 1.0, 10.0),
>>  ('cosine', 3.1415926535897931, 10.0),
>>  ('frechet_l', 0.0, 10.0),
>>  ('gausshyper', 1.0, 10.0),
>>  ('johnsonb', 1.0, 10.0),
>>  ('levy_l', 0.0, 10.0),
>>  ('powerlaw', 1.0, 10.0),
>>  ('rdist', 1.0, 10.0),
>>  ('semicircular', 1.0, 10.0),
>>  ('triang', 1.0, 10.0),
>>  ('uniform', 1.0, 10.0),
>>  ('weibull_max', 0.0, 10.0),
>>  ('wrapcauchy', 6.2831853071795862, 10.0)]
>>
>> only pareto has both limits on the same side of zero
>>
>>>>> pprint ([(d.name, d.a, d.b) for d in dist_cont if d.a*d.b>0])
>> [('pareto', 1.0, inf)]
>>
>>
>> genextreme, and maybe one or two others, are missing because finite a,
>> b are set in _argcheck
>> vonmises is for circular and doesn't behave properly
>>
>> only two distributions define non-generic xa or xb
>>
>>>>> pprint ([(d.name, d.a, d.b, d.xa, d.xb) for d in dist_cont if not d.xa*d.xb==-100])
>> [('foldcauchy', 0.0, inf, -10.0, 1000), ('recipinvgauss', 0.0, inf, -10.0, 50)]
>>
>> a pull request setting correct xa, xb would be very welcome
>>
>> Josef
>>
>>
>>>
>>> Josef
>>>
>>>>
>>>> Nicky
>>>> _______________________________________________
>>>> SciPy-Dev mailing list
>>>> SciPy-Dev at scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>