this was done with 0 lit review on the subject, just out of interest
First, some simple terms:
Below is an excellent image from "Secondary Shelf Life: an Underestimated Issue".
There are many wrong ways to model primary shelf life. A first approximation to model the primary shelf life is to treat purchase events as independent censoring. This has a few important flaws:
Likewise, modelling the secondary shelf-life can be difficult:
Based on the above, you may think that adding duration-in-store as a covariate in a survival model would be sufficient, but the problem goes deeper:
Ontario tomatoes have a much lower secondary shelf than California tomatoes.
The problem is that the worst California tomatoes spoil during transit, and these tomatoes will never be measured for their secondary shelf life.
Where am I going with all this? The idea is to model primary and secondary shelf life together, if the data is available. This model needs to account for
censoring due to administration stopping time, lack of follow up and consuming product: we can't watch a can of chickpeas exist for decades, so we must stop our study somewhere. Furthermore, if we rely on consumers to report back to us, some fraction will not and we have to consider that data censored. Finally, consumers want to eat the product, and hence we may never see the spoilage date.
variable purchase time: consumers can purchase a product at any time during its primary shelf life. Our model should be able to handle this.
Let $\tau_i$ denote the time a consumer purchases / opens food product $i$. In survival analysis, it's often useful to model the hazard instead of the survival function. With that in mind, our shelf life hazard model looks like:
$$ h_i(t) = \begin{cases} h_1(t) & \text{if $t \le \tau_i$} \\ h_2(t-\tau_i) & \text{if $t > \tau_i $} \\ \end{cases} $$This is the most general form. If we think that the same form of degradation that occurs in the store also happens in a consumer's storage, we can model this as:
$$ h_i(t) = \begin{cases} h_1(t) & \text{if $t \le \tau_i$} \\ h_1(t) + h_2(t-\tau_i) & \text{if $t > \tau_i $} \\ \end{cases} $$For example, this model could represent risk of spoilage caused by existing microbe growth, $h_1(t)$, and $h_2(t)$ represents additional risk caused by consumer-caused contamination.
Estimation of these models are complicated by the censoring and the case when a purchase/opened event is not observed. Nevertheless, it can be estimated with some careful coding.
df = pd.read_csv("shelflife.csv")
df.head(5)
T_obs | tau | E | obs_tau | |
---|---|---|---|---|
0 | 13.390 | 1.176 | True | True |
1 | 0.875 | 0.505 | False | True |
2 | 13.003 | 14.003 | False | False |
3 | 13.047 | 0.720 | True | True |
4 | 0.075 | 1.075 | False | False |
%config InlineBackend.figure_format = 'retina'
fig = plt.figure(figsize=(10,6))
ax = fig.subplots()
for i, row in df.loc[:25].iterrows():
if row['obs_tau']:
primary = ax.hlines(i, 0, row['tau'], color='r')
secondary = ax.hlines(i, row['tau'], row['T_obs'], color='r', ls=":")
purchase = ax.scatter(row['tau'], i, color="r", marker="s", s=12)
else:
primary = ax.hlines(i, 0, row['T_obs'], color='r')
m = "" if not row['E'] else "o"
spoilage = ax.scatter(row['T_obs'], i, color="k", marker=m, s=12)
if i == 0:
primary.set_label('primary shelf life')
secondary.set_label('secondary shelf life')
purchase.set_label('opened/purchased event')
spoilage.set_label('spoilage event')
plt.xlabel("time")
plt.xlabel("subject ID")
plt.legend()
<matplotlib.legend.Legend at 0x126f68b00>
For computational reasons, it's better to express the model as a cumulative hazard:
$$ H_i(t) = \begin{cases} H_1(t) & \text{if $t \le \tau_i$} \\ H_2(t-\tau_i) & \text{if $t > \tau_i $} \\ \end{cases} $$Let's give our model a parameteric form:
$$H_1(t) = \alpha t$$$$H_2(t) = \left(\frac{t}{\lambda}\right)^{\rho}$$That is, our primary shelf life is modelled by a constant hazard, and the secondary shelf life is modelled by a Weibull. Below is all the code that is needed to estimate the parameters.
from autograd import grad, elementwise_grad, value_and_grad
from autograd import numpy as np
from scipy.optimize import minimize
def H1(params, t):
log_alpha = params[0]
alpha = np.exp(log_alpha)
return t * alpha
def H2(params, t):
log_lambda, log_rho = params
lambda_, rho = np.exp(log_lambda), np.exp(log_rho)
return (np.clip(t, 1e-4, np.inf) / lambda_) ** rho
def cumulative_hazard(params, t, tau):
#return np.where(t < tau, H1(params[:1], t), H1(params[:1], tau) + H2(params[1:], t-tau))
# does the product keep degrading in the same manner when owned by the consumer?
# recall that hazards are additive
return np.where(t < tau, H1(params[:1], t), H1(params[:1], t) + H2(params[1:], t-tau))
hazard = elementwise_grad(cumulative_hazard, argnum=1)
def log_hazard(params, T, tau):
return np.log(np.clip(hazard(params, T, tau), 1e-15, np.inf))
def survival_function(params, T, tau):
return np.exp(-cumulative_hazard(params, T, tau))
def negative_log_likelihood(params, T, E, tau):
n = T.shape[0]
ll = 0
ll += log_hazard(params, T, tau).sum()
ll += -cumulative_hazard(params, T, tau).sum()
return -ll/n
results = minimize(
value_and_grad(negative_log_likelihood),
x0=np.array([0.0, 0.0, 0.0]),
args=(df['T_obs'].values, df['E'].values.astype(bool), df['tau'].values),
jac=True)
print(results.x)
[-2.8866729 2.58703293 0.92460643]
x = np.linspace(0, 30, 300)
for t in [5.0, 10., 15.0]:
y = [survival_function(results.x, _, t) for _ in x]
plt.plot(x, y, label=r"$\tau=%d$" % t)
plt.title("Estimated survival function of \nproduct to be purchased at different times");
plt.xlabel("time")
plt.ylabel("survival")
plt.legend()
<matplotlib.legend.Legend at 0x12adf65f8>
time = 5.0
x = np.linspace(0, 30, 300)
y = [np.exp(-H1(results.x[:1], _)) for _ in x]
plt.plot(x, y)
plt.title("Estimated primary shelf life survival function" );
plt.xlabel("time")
plt.ylabel("survival")
Text(0, 0.5, 'survival')
time = 5.0
x = np.linspace(0, 30, 300)
y = [np.exp(-H2(results.x[1:], _)) for _ in x]
plt.plot(x, y)
plt.title("Estimated secondary shelf life survival function" );
plt.xlabel("time")
plt.ylabel("survival")
Text(0, 0.5, 'survival')
from lifelines import WeibullFitter, ExponentialFitter
# only primary shelf life dataset - True iif E==true and obs_tau==False
ef = ExponentialFitter()
df.tail(25)
df_primary = df.copy()
df_primary['E'] = (df_primary['E'] & ~df['obs_tau'])
ef.fit(df_primary['T_obs'], df_primary['E']).print_summary()
model | lifelines.ExponentialFitter |
---|---|
number of observations | 1000 |
number of events observed | 24 |
log-likelihood | -170.57 |
hypothesis | lambda_ != 1 |
coef | se(coef) | coef lower 95% | coef upper 95% | z | p | -log2(p) | |
---|---|---|---|---|---|---|---|
lambda_ | 421.49 | 81.06 | 262.61 | 580.36 | 5.19 | <0.005 | 22.16 |
df_secondary = df.copy()
df_secondary = df_secondary[df_secondary['obs_tau']]
df_secondary['T'] = df_secondary['T_obs'] - df_secondary['tau']
wf = WeibullFitter()
wf.fit(df_secondary['T'], df_secondary['E']).print_summary()
model | lifelines.WeibullFitter |
---|---|
number of observations | 749 |
number of events observed | 443 |
log-likelihood | -1422.39 |
hypothesis | lambda_ != 1, rho_ != 1 |
coef | se(coef) | coef lower 95% | coef upper 95% | z | p | -log2(p) | |
---|---|---|---|---|---|---|---|
lambda_ | 12.70 | 0.26 | 12.18 | 13.21 | 44.55 | <0.005 | inf |
rho_ | 2.31 | 0.08 | 2.14 | 2.47 | 15.87 | <0.005 | 186.10 |