[Numpy-discussion] how to fit a given pdf

Wed Aug 11 21:51:04 EDT 2010

On Wed, Aug 11, 2010 at 8:00 PM, Geordie McBain <gdmcbain at freeshell.org> wrote:
> 2010/8/12 Renato Fabbri <renato.fabbri at gmail.com>:
>> Dear All,
>>
>> help appreciated, thanks in advance.
>>
>> how do you fit a pdf you have with a given pdf (say gamma).
>>
>> with the file attached, you can go like:
>>
>> a=open("AC-010_ED-1m37F100P0.txt","rb")
>> aa=a.read()
>> aaa=aa[1:-1].split(",")
>> data=[int(i) for i in aaa]
>>
>> if you do pylab.plot(data); pylab.show()
>>
>> The data is something like:
>> ___|\___
>>
>> It is my pdf (probability density function).
>>
>> how can i find the right parameters to make that fit with a gamma?
>>
>> if i was looking for a normal pdf, for example, i would just find mean
>> and std and ask for the pdf.
>>
>> i've been playing with scipy.stats.distributions.gamma but i have not
>> reached anything.
>>
>> we can extend the discussion further, but this is a good starting point.
>>
>> any idea?
>
> A general point on fitting empirical probability density functions is
> that it is often much easier to fit the cumulative distribution
> function instead.  For one thing, this means you don't have to decide
> on the intervals of the bins in the histogram.  For another, it's
> actually often the cdf that is more related to the final answer
> (though I don't know your application, of course).
>
> Here's a quote.
>
>  `So far the discussion of plots of distributions has emphasized
> frequency (or probability) vs. size plots, whereas for many
> applications cumulative plots are more important. Cumulative curves
> are produced by plotting the percentage of particles (or weight,
> volume, or surface) having particle diameters greater than (or less
> than) a given particle size against the particle size. … Such curves
> have the advantage over histograms for plotting data that the class
> interval is eliminated, and they can be used to represent data which
> are obtained in classified form having unequal class intervals'
> (Cadle, R. D. 1965. Particle Size. New York: Reinhold Publishing
> Corporation, pp. 38-39)
>
> Once you've got your empirical cdf, the problem reduces to one of
> nonlinear curve fitting, for whichever theoretical distribution you
> like.  For a tutorial on nonlinear curve fitting, see
> scipy.optimize.leastsq at
> http://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html.  You
> could of course use this approach for the pdf too, but I fancy the cdf
> result will be more robust.

A similar approach for pareto works easily because the function is
linear in log-log scale, but the properties of these graphical
estimators was criticized in the readings as not being as good as
other estimators.
But they also show that in the graphs, empirical cdf or survival
function are more stable/smooth than histograms, especially in the
tails. MLE uses the pdf or logpdf without binning the data.

>
> On the other hand, if you want something like your `mean and variance'
> approach to fitting normal distributions, you could still compare your
> mean and variance with the known values for the Gamma distribution
> (available e.g. on its Wikipedia page) and back-out the two parameters
> of the distribution from them.  I'm not too sure how well this will
> work, but it's pretty easy.

called method of moments in the literature, as in Jonathan's estimator

>
> Another idea occurs to me and is about as easy as this is to compute
> the two parameters of the Gamma distribution by collocation with the
> empirical cdf; i.e. pick two quantiles, e.g. 0.25 and 0.75, or
> whatever, and get two equations for the two unknown parameters by
> insisting on the Gamma cdf agreeing with the empirical for these
> quantiles.  This might be more robust than the mean & variance
> approach, but I haven't tried either.

It works pretty well, see
http://www.johndcook.com/blog/2010/01/31/parameters-from-percentiles/
and his other articles cited at the bottom.
(Using a generalized method of moment approach, this can be
generalized to using more than two quantiles, which works well in my
examples.)

On advantage of generic MLE is that I know how to get (asymptotic)
standard errors and confidence intervals, while for many of the other
estimators bootstrap or checking the distribution specific formulas is
necessary.

Josef

>
> Good luck!
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>