[SciPy-User] Wording question regarding to distributions

Thu Jan 14 22:14:21 EST 2010

On Thu, Jan 14, 2010 at 7:10 PM, Gökhan Sever <gokhansever at gmail.com> wrote:
>
>
> On Thu, Jan 14, 2010 at 4:30 PM, Gökhan Sever <gokhansever at gmail.com> wrote:
>>
>> Hello,
>>
>> What is the right way to express:
>>
>> Do we fit data to a distribution or distribution to data?
>>
>> Thanks.
>>
>>
>>
>> Gökhan
>
> Here is how the question arise in my mind.
>
> Previously, I had asked a question to fit a log-normal distribution on my
> data on this thread
> http://mail.scipy.org/pipermail/scipy-user/2009-November/023320.html
>
> Well the work is unfinished there, and I started to dig-in to the same
> subject again. For R, I have found a function that lets me estimate
> parameters from my binned data pair (i.e bin sizes - measurements) to
> construct a log-normal fit:
>
> http://www.exposurescience.org/heR.doc/library/heR.Misc/html/bin2lnorm.html
>
> The description given for the function is in conflict with itself:
>
> The title says: "Fit binned data to a log-normal distribution"
>
> However description says different:
>
> "This function takes binned data and fits a lognormal model to it, using
> weighted least squares, and optionally plotting the fit and the data
> together"
>
> I couldn't find a way to estimate log-normal parameters in Python (maybe I
> will need the same for the gamma distributions as well) given in the form as
> bin2lnorm (i.e. l- bin limits, and h- corresponding heights (measurements in
> my case)) that is the reason I use that R function. Any new alternative
> suggestions as welcome this point.
>
> Similarly, while I studying my Cloud and Precipitation Parameterizations
> book today (Distributions are extremely important in bulk-parameterization
> of clouds and cloud-constituents/products (e.g. aerosols, cloud-droplets,
> rain, hail etc...) I see in a couple figures (Please see the book review at
> http://www.cambridge.org/catalogue/catalogue.asp?isbn=9780521883382&ss=exc
> and go to pg 9. Figure 1.2) using statements like: "gamma curves fit to
> data."
>
> It's clearer now after reading your inputs.
>
> Thanks again.
> --
> Gökhan
>

Depends on what you mean by 'data'.  However, like many things,
terminology is rather flexible, misused or just incomplete.

Typically you have random variables
(http://en.wikipedia.org/wiki/Random_variables) from some distribution
such as multivariate normal. Note that a distribution is a rather
complex thing which has various properties
(http://en.wikipedia.org/wiki/Probability_distribution).

When you want to see if the data is from some distribution that you do
not know, then you are testing a hypothesis that your data, as a
whole, has certain characteristics of random variables from that
distribution.  Central limit theorem makes many distributions very
similar (i.e. like a normal distribution) with sufficient observations
when it holds. However, you can not say that the data are random
variables from that distribution nor that all data points are from the
distribution.

So if your data are random variables then neither saying is correct.

Bruce