[SciPy-User] adding distributions from hydroclimpy to stats.distributions

Mon Aug 3 00:20:01 EDT 2009

On Sun, Aug 2, 2009 at 3:52 PM, Pierre GM<pgmdevlist at gmail.com> wrote:
>
> On Aug 2, 2009, at 9:41 AM, josef.pktd at gmail.com wrote:
>>
>> I looked briefly at the distributions in hydroclimpy
>> http://projects.scipy.org/scikits/browser/trunk/hydroclimpy/scikits/hydroclimpy/stats/extradistributions.py
>>
>> my first impression:
>>
>> kappa, glogistic, gennorm and wakeby
>> can be added almost without changes to stats distributions, since they
>> are already in the standard format
>> cosmetic changes: add longname and extradocs (from module docstring)
>
> Agreed. I still have an issue about defining a proper template for
> describing the distributions (eqns for pdf/cdf/ppf, example of usage,
> plots...), hence the nudge. What are our doc exegetes' recommendations ?

currently only the class docstring gets distribution specific
information, done in rv_generic.__init__,
replacing name of distribution and shape parameters, and adding
extradocs. The example in the class docstring is just a generic
example. More description of the distributions could simply be added
to the extradocs, which show up in the help at the bottom of the
individual distribution class/instance docstrings. For example the
available formulas for the distribution pdf, cdf, sf, ..., (info from
Travis manual) could be included in the extradoc.)

For methods there is currently no setup for individualized information
and examples. But this could be changed, if we really want to.
However, since methods for all distribution follow the same pattern, I
don't see really the need. Better overall help and examples, e.g. in
the tutorial, would be useful.

Any good ideas for improvements?

>
>
>> pearson3
>> This one overwrites the main public methods, pdf, cdf, ...,
>> Can this be rewritten to define only the private, distribution
>> specific methods, _cdf, _pdf, or is there a special reason for the
>> public methods?
>
> Depending on the values of the parameters, Pearson III can reduce to a
> normal. Overwriting .pdf and .cdf was IMHO more efficient than trying
> to stick to the _pdf/_cdf methods. The same problem arises when a
> distribution reduces to another in some particular cases.
>

I will have to look how you did it.
Per Brodtkorb implemented some changes where distribution in the limit
with respect to changes in the parameters are included. In this case
the methods get a bit heavy in conditional assignments, but still
follow the standard pattern.

Your description of Pearson III could be a bit more explicit about the
relationship to the gamma distribution. From searching on the
internet, it isn't clear to me whether gamma is the same as Pearson
III, hinted at by Wikipedia and it looks this way also in
http://mathworld.wolfram.com/PearsonSystem.html
or whether there are different versions.

>
>
>> ztnbinom and logseries look like duplicates of stats.nbinom and
>> stats.logser
>> ztnbinom uses a different way to calculate stats
>
> ztnbinom is the zero-truncated negative binomial distribution, a
> particular case of the negative binomial where support is restricted
> to integers larger or equal than 1 (no zero class). Yes, the stats are
> slightly different because of the truncation.

Initially, I didn't look carefully enough to see that there is the
scaling term in the cdf and pdf. The extradoc string still has 0<=k
and rvs doesn't remove the zeros in the numpy.random numbers (?).

> Similarly, we can define
> a zero-inflated Poisson.
>
> I considered developing a generic trunc_dist class from rv_discrete to
> handle arbitrary truncation, but realize that the scope was too large
> for me to handle, and I've already far enough on my plate(s) for now.

For the continuous distribution, I wanted to do this in a similar way
as I did for creating distributions based on non-linear
transformations. But I played only a bit with it for the discrete case
when I checked the correctness of the expect function.
A basic version that is fully functional and uses the generic methods
should be pretty easy to write, but testing, optimizing, ...

>
>> logseries adds a fit function
>> Is there a difference that I'm missing after my only brief look?
>
> I had overlooked the logser distribution (silly me). Adding the fit
> method is required for my own applications (analyzing dry/wet spells
> distributions). I'm about to add fit methods to other discrete
> distribution as I need them.
>
>
>
>> I don't know anything about L moments and only briefly looked up the
>> definitions. Is there a generic method, that works (reasonably well)
>> for all distribution?
>
> L-moments are defined for continuous distributions only. You can find
> a nice description of their definition and use here:
> http://www.research.ibm.com/people/h/hosking/lmoments.html
> In short, they tend to be more robust that the classical moments. The
> facts that the L-kurtosis and L-skewness are in the interval [-1;+1]
> simplifies the comparisons between different distributions when trying
> to define the most adequate one.
> L-moments of some specific distributions have an explicit formulation
> that can help estimating the parameters of these distributions (hence
> the whole lmoments.py module).
>

I looked at this and similar references, and it looks interesting. Do
I have to be a bit suspicious because there are almost no statistics
journals in the list of references on the ibm page? Did L-moments only
become popular in hydrology?
Just a quick check: I didn't find L-moments in the SAS help, but R has
a package for it, and the fortran code by Hosking is BSD (or similar).

>
>> I assume the main work would be to make sure that adding a new method
>> would work with all distributions. I would gladly review a patch, but
>> I don't have the time to do the integration into stats.distributions
>> and the testing myself.
>
> OK, what about we keep them on the backburner for now ? Hopefully I'll
> have more time to deal with polishing the docs and adding more tests
> soon. My advertising these new distributions was primarily to let
> other users know that they're already implemented somewhere, to
> illustrate the need for a doc template
>

I had already looked a bit closer at your lmoments, and they are
already pretty well integrated with the stats.distribution code. It
should be possible to add some generic tests that work for the
distributions that are not included in your tests.

A python question, since I'm never sure about the details of monkey
patching and it always takes a while to track down the correct
references

In lmoments you attach functions to distribution classes directly.
>From my (possibly wrong) understanding these functions should remain
functions and not turn into instance methods. However, it looks like
you use them as instance methods in the tests. Does self get replaced
by the instance in the call?
I thought that to attach a function as an instance method, either
new.instancemethod or types.MethodType have to be used.
What am I missing?

http://projects.scipy.org/scikits/browser/trunk/hydroclimpy/scikits/hydroclimpy/stats/lmoments.py

340	# moment from definition
341	def _lmomg(self, m, *args):
342	    "Compute the mth L-moment with a Legendre polynomial."
343	    P_r = special.sh_legendre(m-1)
344	    func = lambda x : self._ppf(x, *args) * P_r(x)
345	    return integrate.quad(func, 0, 1)[0]
346	dist.rv_continuous._lmomg = _lmomg
347	
348	def _lmomg_frozen(self, nmom):
349	    return self.dist._lmomg(nmom, *self.args, **self.kwds)
350	dist.rv_frozen._lmomg = _lmomg_frozen

and

508	dist.expon_gen.lstats = _lstats_direct(_lmoments.lmrexp)
509	dist.gamma_gen.lstats = _lstats_direct(_lmoments.lmrgam)
510	dist.genextreme_gen.lstats = _lstats_direct(_lmoments.lmrgev)
511	extradist.glogistic_gen.lstats = _lstats_direct(_lmoments.lmrglo)

BTW:
I like it that you started writing some developer notes on
stats.distribution. There are many things that took me a long time to
figure out, and developer notes explaining the internals would have
come in handy. The main is the state preserving call to _argcheck when
the bounds depend on parameters, which creates some entertaining,
state dependent bugs.
I wish distributions were classes instead of class instances.

Josef

> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>