[SciPy-Dev] New feature: fitting distributions to censored data.
Warren Weckesser
warren.weckesser at gmail.com
Thu Mar 25 11:24:31 EDT 2021
Hey folks,
A new enhancement for fitting probability distributions to censored
data by maximum likelihood estimation is in progress in
https://github.com/scipy/scipy/pull/13699
This is a feature in the statistics roadmap, and is part of the work
for the CZI EOSS cycle 1 grant.
Censored data is data where the true value of the measurement is
unknown, but there is a known bound. The common terminology is:
* left-censored:
The true value is greater than some known bound.
* right-censored:
The true value is less than some known bound.
* interval-censored:
The true value is between known bounds.
(By allowing the bounds to be +inf and -inf, we can think of all these
cases as interval-censored.)
In the PR, a new class, `CensoredData`, is defined to represent
censored data. The `fit` method of the univariate continuous
distributions is updated to accept an instance of `CensoredData`. The
`CensoredData` class constructor accepts two arrays, `lower` and
`upper`, to hold the known bounds of the data. (If `lower[i] ==
upper[i]`, it means that data value is not censored.)
Because it is quite common for data to be encountered where the only
censored values in the data set are either all right-censored or all
left-censored, two convenience methods,
`CensoredData.right_censored(x, censored)` and
`CensoredData.left_censored(x, censored)`, are provide to create a
`CensoredData` instance from an array of values and a corresponding
boolean array that indicates if the value is censored.
Here's a quick example, with data from
https://www.itl.nist.gov/div898/handbook/apr/section4/apr413.htm. The
data table shows 10 regular values and 10 right-censored values. The
results reported there for fitting the two-parameter Weibull
distribution (location fixed at 0) to that data are
shape = 1.7208
scale = 606.5280
Here's the calculation using the proposed API in SciPy:
In [55]: from scipy.stats import weibull_min, CensoredData
Create the `CensoredData` instance:
In [56]: x = np.array([54, 187, 216, 240, 244, 335, 361, 373, 375,
386] + [500]*10)
In [57]: data = CensoredData.right_censored(x, x == 500)
In [58]: print(data)
CensoredData(20 values: 10 not censored, 10 right-censored)
Fit `weibull_min` (with the location fixed at 0) to the censored data:
In [59]: shape, loc, scale = weibull_min.fit(data, floc=0)
In [60]: shape
Out[60]: 1.720797180719942
In [61]: scale
Out[61]: 606.527565269458
Matt Haberland has already suggested quite a few improvements to the
PR. Additional comments would be appreciated.
Warren
More information about the SciPy-Dev
mailing list