[SciPy-Dev] New feature: fitting distributions to censored data.

Thu Mar 25 11:24:31 EDT 2021

Hey folks,

A new enhancement for fitting probability distributions to censored
data by maximum likelihood estimation is in progress in

    https://github.com/scipy/scipy/pull/13699

This is a feature in the statistics roadmap, and is part of the work
for the CZI EOSS cycle 1 grant.

Censored data is data where the true value of the measurement is
unknown, but there is a known bound.  The common terminology is:

    * left-censored:
        The true value is greater than some known bound.

    * right-censored:
        The true value is less than some known bound.

    * interval-censored:
        The true value is between known bounds.

(By allowing the bounds to be +inf and -inf, we can think of all these
cases as interval-censored.)

In the PR, a new class, `CensoredData`, is defined to represent
censored data.  The `fit` method of the univariate continuous
distributions is updated to accept an instance of `CensoredData`. The
`CensoredData` class constructor accepts two arrays, `lower` and
`upper`, to hold the known bounds of the data. (If `lower[i] ==
upper[i]`, it means that data value is not censored.)

Because it is quite common for data to be encountered where the only
censored values in the data set are either all right-censored or all
left-censored, two convenience methods,
`CensoredData.right_censored(x, censored)` and
`CensoredData.left_censored(x, censored)`, are provide to create a
`CensoredData` instance from an array of values and a corresponding
boolean array that indicates if the value is censored.

Here's a quick example, with data from
https://www.itl.nist.gov/div898/handbook/apr/section4/apr413.htm.  The
data table shows 10 regular values and 10 right-censored values.  The
results reported there for fitting the two-parameter Weibull
distribution (location fixed at 0) to that data are

    shape = 1.7208
    scale = 606.5280

Here's the calculation using the proposed API in SciPy:

In [55]: from scipy.stats import weibull_min, CensoredData

Create the `CensoredData` instance:

In [56]: x = np.array([54, 187, 216, 240, 244, 335, 361, 373, 375,
386] + [500]*10)

In [57]: data = CensoredData.right_censored(x, x == 500)

In [58]: print(data)

CensoredData(20 values: 10 not censored, 10 right-censored)

Fit `weibull_min` (with the location fixed at 0) to the censored data:

In [59]: shape, loc, scale = weibull_min.fit(data, floc=0)

In [60]: shape

Out[60]: 1.720797180719942

In [61]: scale

Out[61]: 606.527565269458

Matt Haberland has already suggested quite a few improvements to the
PR. Additional comments would be appreciated.

Warren