[SciPy-User] [Numpy-discussion] Fitting a curve on a log-normal distributed data

Tue Nov 17 16:01:56 EST 2009

On Tue, Nov 17, 2009 at 3:41 PM, Robert Kern <robert.kern at gmail.com> wrote:
> On Tue, Nov 17, 2009 at 14:04,  <josef.pktd at gmail.com> wrote:
>
>> The way I see it, you have to variables, size and counts (or concentration).
>> My initial interpretation was you want to model the relationship between
>> these two variables.
>> When the total number of particles is fixed, then the conditional size
>> distribution is univariate, and could be modeled by a log-normal
>> distribution. (This still leaves the total count unmodelled.)
>>
>> If you have the total particle count per bin, then it
>> should be possible to write down the likelihood function that is
>> discretized to the bins from the continuous distribution.
>> Given a random particle, what's the probability of being in bin 1,
>> bin 2 and so on. Then add the log-likelihood over all particles
>> and maximize as a function of the log-normal parameters.
>> (There might be a numerical trick using fraction instead of
>> conditional count, but I'm not sure what the analogous discrete
>> distribution would be. )
>
> I usually use the multinomial as the likelihood for such
> "histogram-fitting" exercises. The two problem points here are that we
> have real-valued concentrations, not integer-valued counts, and that
> we don't have a measurement for the censored region. For the former, I
> would suggest simply multiplying by the concentrations by a factor of
> 10 (equivalently, changing the units to particles/<10^n larger
> volume>) such that the resolution of the measurements is 1
> particle/<volume>. Then just apply the multinomial. It should be a
> close enough approximation.
>
> I'm not entirely sure what to do about the censored probability mass.
> I think there might be a simple correction factor that you can apply
> to the multinomial likelihood, but I haven't worked it out.

I think, for the continuous distribution it would be just dividing by
the probability of the not-censored region (which is also a function of
the distribution parameters). This would then just be a truncated
log-normal. multinomial might work the same, as long as the
probabilities are defined by the discretization.

Would you apply the multinomial directly? I don't see in that case
how you would recover the parameters of the continuous distribution.

Josef

>
>> Once the parameters of the log-normal distribution are
>> estimated, the distribution would be defined over all of
>> the real line (where the out of sample pdf is determined
>> by assumption not data).
>
> Since we are extrapolating to the censored region, it would probably
> be a good idea to estimate the uncertainty of the estimate. I would
> probably suggest using PyMC to do a Bayesian model. A parametric
> bootstrap might also serve.

I would use bootstrap, since I still haven't figured out how to use MCMC.

Josef

>
> --
> Robert Kern
>
> "I have come to believe that the whole world is an enigma, a harmless
> enigma that is made terrible by our own mad attempt to interpret it as
> though it had an underlying truth."
>  -- Umberto Eco
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>