[SciPy-User] [Numpy-discussion] Fitting a curve on a log-normal distributed data

Robert Kern robert.kern at gmail.com
Tue Nov 17 15:41:47 EST 2009


On Tue, Nov 17, 2009 at 14:04,  <josef.pktd at gmail.com> wrote:

> The way I see it, you have to variables, size and counts (or concentration).
> My initial interpretation was you want to model the relationship between
> these two variables.
> When the total number of particles is fixed, then the conditional size
> distribution is univariate, and could be modeled by a log-normal
> distribution. (This still leaves the total count unmodelled.)
>
> If you have the total particle count per bin, then it
> should be possible to write down the likelihood function that is
> discretized to the bins from the continuous distribution.
> Given a random particle, what's the probability of being in bin 1,
> bin 2 and so on. Then add the log-likelihood over all particles
> and maximize as a function of the log-normal parameters.
> (There might be a numerical trick using fraction instead of
> conditional count, but I'm not sure what the analogous discrete
> distribution would be. )

I usually use the multinomial as the likelihood for such
"histogram-fitting" exercises. The two problem points here are that we
have real-valued concentrations, not integer-valued counts, and that
we don't have a measurement for the censored region. For the former, I
would suggest simply multiplying by the concentrations by a factor of
10 (equivalently, changing the units to particles/<10^n larger
volume>) such that the resolution of the measurements is 1
particle/<volume>. Then just apply the multinomial. It should be a
close enough approximation.

I'm not entirely sure what to do about the censored probability mass.
I think there might be a simple correction factor that you can apply
to the multinomial likelihood, but I haven't worked it out.

> Once the parameters of the log-normal distribution are
> estimated, the distribution would be defined over all of
> the real line (where the out of sample pdf is determined
> by assumption not data).

Since we are extrapolating to the censored region, it would probably
be a good idea to estimate the uncertainty of the estimate. I would
probably suggest using PyMC to do a Bayesian model. A parametric
bootstrap might also serve.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco



More information about the SciPy-User mailing list