Suppose we generate data from a mixture of exponentials with no censoring (this could be extended to censoring, but for our application we don't observe censoring). Furthermore, the data is not observed directly, but our instrument has binned the data into integer buckets. For example, the actual value is 4.3, but our instrument's resolution is too poor to measure the decimal point. So 4.3 goes into the "4.x" bucket.
This model is easy to handle in lifelines. Instead of worrying about the binning, we can treat the system as an interval-censoring problem. That is, if an observation landed in the $i$th bin, then we know the true data point occured somewhere between the $i$th and $i+1$th bin. This is precisely interval censoring.
We can use lifelines custom model semantics to create a mixture model as well. The true model is:
$$S(t) = p \exp\left(-\frac{t}{\lambda_1}\right) + (1-p)\exp\left(-\frac{t}{\lambda_2}\right)$$Therefore the cumulative hazard is:
$$H(t) = -\log(S(t)) = -\log\left(p \exp\left(-\frac{t}{\lambda_1}\right) + (1-p)\exp\left(-\frac{t}{\lambda_2}\right)\right) $$from numpy.random import exponential
from lifelines.fitters import ParametricUnivariateFitter
import autograd.numpy as np
np.random.seed(10)
T = [exponential(20) for _ in range(10000)] + [exponential(40) for _ in range(500)]
counts_obs = np.bincount(T)
T_obs = np.arange(np.amax(T))
plt.figure(figsize=(13,5))
plt.hist(counts_obs, bins=T_obs, density=True)
plt.title("Histogram of observed durations")
plt.xlim(0, 100)
(0, 100)
# To help the model avoid the "non-identibility" problem, we can set the *upper* bound of the first lambda to be
# the average of the observed data, and the *lower* bound of the second lambda to the same value. Why?
# We'd like to partition the postive reals into two halves, each containing one of the lambdas. The sample
# mean of the data is v = p * lambda1 + (1-p) * lambda2, which has the property lambda1 < v < lambda2, therefore
# it will partition the space correctly.
mean_obs = np.average(T_obs, weights=counts_obs)
print(mean_obs)
20.015428571428572
class MixtureExponential(ParametricUnivariateFitter):
_fitted_parameter_names = ['lambda1', 'lambda2', 'p']
_bounds = [(0, mean_obs), (mean_obs, None), (0, 1)]
def _cumulative_hazard(self, params, t):
l1, l2, p = params
return -np.log(p * np.exp(-t / l1) + (1-p) * np.exp(-t / l2))
model = MixtureExponential()
model.fit_interval_censoring(
lower_bound=T_obs,
upper_bound=T_obs + 1,
weights=counts_obs,
initial_point=np.array([mean_obs / 2, mean_obs * 2, 0.5])
)
model.print_summary()
model | lifelines.MixtureExponential |
---|---|
number of observations | 10500 |
number of events observed | 0 |
log-likelihood | -42208.35 |
hypothesis | lambda1 != 10.0077, lambda2 != 40.0309, p != 0.5 |
coef | se(coef) | coef lower 95% | coef upper 95% | z | p | -log2(p) | |
---|---|---|---|---|---|---|---|
lambda1 | 19.00 | 1.01 | 17.01 | 20.98 | 8.88 | <0.005 | 60.36 |
lambda2 | 36.46 | 10.32 | 16.24 | 56.68 | -0.35 | 0.73 | 0.46 |
p | 0.91 | 0.10 | 0.71 | 1.11 | 4.06 | <0.005 | 14.30 |
model.plot_cumulative_hazard(figsize=(8,5))
<matplotlib.axes._subplots.AxesSubplot at 0x11d8f9390>