[SciPy-Dev] New feature: fitting distributions to censored data.

Robert Kern robert.kern at gmail.com
Fri Mar 26 09:22:35 EDT 2021


On Fri, Mar 26, 2021 at 5:11 AM Ralf Gommers <ralf.gommers at gmail.com> wrote:

>
>
> On Thu, Mar 25, 2021 at 4:24 PM Warren Weckesser <
> warren.weckesser at gmail.com> wrote:
>
>> Hey folks,
>>
>> A new enhancement for fitting probability distributions to censored
>> data by maximum likelihood estimation is in progress in
>>
>>     https://github.com/scipy/scipy/pull/13699
>
>
> Thanks Warren. Overall this looks quite good to me.
>
>
>>
>> This is a feature in the statistics roadmap, and is part of the work
>> for the CZI EOSS cycle 1 grant.
>>
>> Censored data is data where the true value of the measurement is
>> unknown, but there is a known bound.  The common terminology is:
>>
>>     * left-censored:
>>         The true value is greater than some known bound.
>>
>>     * right-censored:
>>         The true value is less than some known bound.
>>
>
It's the other way around, isn't it?


>     * interval-censored:
>>         The true value is between known bounds.
>>
>> (By allowing the bounds to be +inf and -inf, we can think of all these
>> cases as interval-censored.)
>>
>> In the PR, a new class, `CensoredData`, is defined to represent
>> censored data.  The `fit` method of the univariate continuous
>> distributions is updated to accept an instance of `CensoredData`. The
>> `CensoredData` class constructor accepts two arrays, `lower` and
>> `upper`, to hold the known bounds of the data. (If `lower[i] ==
>> upper[i]`, it means that data value is not censored.)
>>
>> Because it is quite common for data to be encountered where the only
>> censored values in the data set are either all right-censored or all
>> left-censored, two convenience methods,
>> `CensoredData.right_censored(x, censored)` and
>> `CensoredData.left_censored(x, censored)`, are provide to create a
>> `CensoredData` instance from an array of values and a corresponding
>> boolean array that indicates if the value is censored.
>>
>> Here's a quick example, with data from
>>
>> https://www.itl.nist.gov/div898/handboodifferentlyk/apr/section4/apr413.htm
>> <https://www.itl.nist.gov/div898/handbook/apr/section4/apr413.htm>.  The
>> data table shows 10 regular values and 10 right-censored values.  The
>> results reported there for fitting the two-parameter Weibull
>> distribution (location fixed at 0) to that data are
>>
>>     shape = 1.7208
>>     scale = 606.5280
>>
>> Here's the calculation using the proposed API in SciPy:
>>
>> In [55]: from scipy.stats import weibull_min, CensoredData
>>
>> Create the `CensoredData` instance:
>>
>> In [56]: x = np.array([54, 187, 216, 240, 244, 335, 361, 373, 375,
>> 386] + [500]*10)
>>
>> In [57]: data = CensoredData.right_censored(x, x == 500)
>>
>
> This `x == 500` looks a little odd API-wise. Why not `right_censored(x,
> 500)`. Or, more importantly, why not something like:
>
>     data = CensoredData(x).rightcensor(500)
>
> There are of course multiple ways of doing this, but the use of
> classmethods and only taking arrays seems unusual. Also in the constructor,
> why do `lower` and `upper` have to be boolean arrays rather than scalars?
>

The value in `x` itself is the bound. Each data point can be censored with
a different bound. The censoring bound is not a shared property of the
whole dataset. It just happened to be the case for this example (which may
indicate that a more general motivating example should be used, at least
while the API is under design). For example, if you are studying the
duration of some condition in a longitudinal study. Individuals entered the
study at different times, and now you have to close the books and write the
paper, but some stubborn individuals still have the condition. The
durations for those individuals would be right-censored, but with different
durations because of their different entry points.

-- 
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/scipy-dev/attachments/20210326/75165ccb/attachment.html>


More information about the SciPy-Dev mailing list