We describe the penPHcure R package, which implements the semiparametric proportional-hazards (PH) cure model of Sy and Taylor (2000) extended to time-varying covariates and the variable selection technique based on its SCAD-penalized likelihood proposed by Beretta and Heuchenne (2019b). In survival analysis, cure models are a useful tool when a fraction of the population is likely to be immune from the event of interest. They can separate the effects of certain factors on the probability of being susceptible and on the time until the occurrence of the event. Moreover, the penPHcure package allows the user to simulate data from a PH cure model, where the event-times are generated on a continuous scale from a piecewise exponential distribution conditional on time-varying covariates, with a method similar to Hendry (2014). We present the results of a simulation study to assess the finite sample performance of the methodology and illustrate the functionalities of the penPHcure package using criminal recidivism data.
In contrast to other statistical methods, survival analysis models are designed to model the time to an event of interest (e.g., death or occurrence of a disease in medical studies). A typical feature of time-to-event data is the presence of right censoring, an incomplete information problem that arises when a subject is lost to follow-up or does not experience the event before the end of the study. In these cases, it is unknown whether the subject will eventually experience the event and when it will occur, given that it can occur. The most common assumption of standard survival analysis models is that the whole population will sooner or later experience the event of interest. However, in practice, this may not be the case because a fraction of the population may be immune (i.e., not susceptible) to this event. Cure models, also known as split population duration models or limited-failure population models, were developed to handle this kind of situation. They allow us to investigate the effects of some covariates (e.g., type of treatment, stage of the tumor, sex, or age) on the probability to be susceptible to the event of interest (i.e., incidence), and on the survival time conditional on being susceptible (i.e., latency).
Originally, cure models were introduced in the medical literature by Boag (1949) and Berkson and Gage (1952), but they have been used in several other disciplines during the years. In reliability engineering, Meeker (1987) investigates the failure of solid-state electronic components (e.g., integrated circuits). In social science, Schmidt and Witte (1989) investigate the timing of return to prison for a sample of prison releases, and they use it to make predictions of whether or not individuals return to prison. In finance, Cole and Gunther (1995) analyze the determinants of commercial bank failures in the United States; in credit scoring, Tong et al. (2012) predict defaults on a portfolio of UK personal loans. In political science, Svolik (2008) studies the likelihood that a democracy consolidates and the timing of authoritarian reversals in democracies that are not consolidated. In marketing, Polo et al. (2011) investigate the drivers of customer retention in a liberalizing market, using data for a sample of 650 consumers in the Spanish mobile phone industry. In the literature, several variants of cure models have been proposed (see Amico and Van Keilegom (2018) for a comprehensive survey), which belong to two main families: mixture cure models and promotion time cure models.
In this article, we present the penPHcure package (Beretta and Heuchenne 2019a), which
implements the semiparametric proportional-hazards (PH) mixture cure
model of Sy and Taylor (2000) extended to time-varying covariates, where the
incidence and latency distributions are modeled by a logistic regression
and a Cox’s PH model (Cox 1972), respectively. The penPHcure package
contains two main functions: penpHcure
, to estimate the regression
coefficients, their confidence intervals using the basic/percentile
bootstrap method, and to perform variable selection using the
SCAD-penalized likelihood technique proposed by Beretta and Heuchenne (2019b);
and penpHcure.simulate
to simulate data from a PH cure model, where
the event-times are generated on a continuous scale from a piecewise
exponential distribution conditional on time-varying covariates, using a
method similar to the one described in Hendry (2014).
At the time of writing this article, we are unaware of other R packages
for estimation of semiparametric PH mixture cure models with
time-varying covariates and, above all, that enable the user to perform
variable selection. In the context of cure models for right-censored
data, available R packages include: the
flexsurvcure
package (Amdahl 2019) for estimation of parametric mixture and
non-mixture cure models with time-invariant covariates using
time-to-event distributions from the
flexsurv package
(Jackson 2016); the nltm
package (Garibotti et al. 2019) for estimation of the semiparametric PH cure model with
time-invariant covariates of Tsodikov et al. (2003), as well as other
nonlinear transformation models for analyzing survival data using the
method of Tsodikov (2003); and the
smcure package (Cai et al. 2012)
for estimation of the semiparametric PH cure model and the accelerated
failure time cure model with time-invariant covariates and the
spduration package
(Beger et al. 2018) that implements a parametric cure model with time-varying
covariates using Weibull and Log-Logistic latency distributions.
Compared to spduration
, the penPHcure package has some advantages:
the latency distribution is modeled by a more flexible semiparametric
Cox’s PH model; the response variable and the time to the event of
interest are continuous; and, above all, it allows the user to
simultaneously select variables and estimate their parameters using a
variable selection technique based on SCAD penalties.
The remainder of this article is structured as follows:
penPHcure
function when the argument
pen.type
is set equal to "none"
(default);
penPHcure
function when pen.type=="SCAD"
;penPHcure.simulate
function, which generates data from a PH
cure model with time-varying covariates.penPHcure
function.penPHcure
function to analyze a real data
set.Let \(Y\) be a Bernoulli random variable indicating whether an individual is susceptible (\(Y=1\)) or immune (\(Y=0\)) to the event of interest with probability \(p=P(Y=1)\). Let \(T\) be the time to event, defined only when \(Y=1\). Assuming that a fraction of the population is immune to the event of interest, the marginal survival function of \(T\) is defined as \[S(t)=(1-p)+pS(t|Y=1),\] where \(p\) is the incidence (i.e., probability of being susceptible) and \(S(t|Y=1)\) is the latency (i.e., survival function conditional on being susceptible).
The incidence is modeled by a logistic regression model: \[p(\textbf{x})=P(Y=1|\mathbf{x})=\frac{\exp(\mathbf{x}'\textbf{b})}{1+\exp(\mathbf{x}'\textbf{b})},\] where \(\textbf{x}\) is a vector of time-fixed covariates (including the intercept) and \(\mathbf{b}\) a vector of unknown coefficients. Whereas the latency is modeled by a Cox’s PH model: \[S(t|Y=1,\bar{\textbf{z}}(t))=\exp\left( - \int_0^t h_{0}(u)e^{\textbf{z}'(u)\boldsymbol{\beta} du}\right),\] where \(\textbf{z}(t)\) is a vector of time-varying covariates (we denote by \(\bar{\textbf{z}}(t)\) the full history of the covariates up to time \(t\)), \(\boldsymbol{\beta}\) is a vector of unknown coefficients, and \(h_{0}(t)\) is an arbitrary baseline conditional hazard function.
Let \(\textbf{O}=\left\{ (t_{i},\delta_{i},\bar{\textbf{z}}_{i}(t_i),\textbf{x}_{i}) ; i=1,...,n \right \}\) denote the observed data, where \(t_{i}\) is the event/censoring time and \(\delta_{i}\) is the censoring indicator, which takes value 1 if \(t_{i}\) is uncensored and 0 otherwise. Since we know that \(y_i=1\) when \(\delta_i=1\), but \(y_i\) is unobserved when \(\delta_i=0\), we can estimate the unknown parameters \(\boldsymbol{\theta}=(\textbf{b},\boldsymbol{\beta},h_{0})\) using the Expectation-Maximization (EM) algorithm (Dempster et al. 1977). The complete-data likelihood can be written as \[\label{eq:comp_data_likelihood} L_{C}\left(\textbf{b},\boldsymbol{\beta},h_{0}\right)=\underbrace{ \prod_{i=1}^{n}p(\textbf{x}_i)^{y_{i}}\left[1-p(\textbf{x}_i)\right]^{(1-y_{i})} }_{L_{1}(\textbf{b})} \times \underbrace{ \prod_{i=1}^{n}\left[h_{0}(t_{i})e^{\textbf{z}'_{i}(t_{i})\boldsymbol{\beta}}\right]^{\delta_{i}y_{i}}\left[e^{-\int_{0}^{t_i}h_{0}(u)e^{\textbf{z}'_{i}(u)\boldsymbol{\beta}}\ du}\right]^{y_{i}} }_{L_{2}(\boldsymbol{\beta},h_{0})}, \tag{1}\] i.e., the product between the incidence component \(L_{1}\) depending on a set of time-fixed covariates \(\textbf{x}_i\), and the latency component \(L_{2}\) depending on a set of time-varying covariates \(\textbf{z}'_{i}(t)\).
Given some starting values \(\boldsymbol{\theta}^{(0)}\), the \(m\)-th iteration of the EM algorithm consists of two steps:
Compute the expectation of the complete-data likelihood with respect to the conditional distribution of the \(y_i\)’s given the current parameter estimates \(\hat{\boldsymbol{\theta}}^{(m-1)}\) and the observed data \(\textbf{O}\). This expectation is obtained by replacing the \(y_i\)’s in (1) by their expectation \[\pi_{i}^{(m)}=E\left[Y_{i}\left|\hat{\boldsymbol{\theta}}^{(m-1)},\textbf{O}\right.\right]= \delta_i + (1-\delta_i) \frac{p(\textbf{x}_i)S(t|Y=1,\bar{\textbf{z}}_i(t))}{1-p(\textbf{x}_i)+p(\textbf{x}_i)S(t|Y=1,\bar{\textbf{z}}_i(t))}.\] Note that we removed the dependence of the theoretical functions on the estimated parameters to simplify the notation.
Maximize the expected complete-data likelihood with respect to \(\textbf{b}\), \(\boldsymbol{\beta}\), and the function \(h_0\). Given \(\boldsymbol{\pi}^{(m)}=\{\pi_{1}^{(m)},...,\pi_{n}^{(m)}\}\), the incidence component \(L_{1}\) in (1) is maximized using the Newton-Raphson method as in the classical logistic regression model. Whereas the latency component \(L_{2}\) in (1) is maximized using a profile likelihood approach. The latter involves two steps: (i) the baseline conditional hazard function is estimated nonparametrically by \[\label{eq:base_haz_est} \hat{h}_0(t) = \frac{1}{\left(t_{(j)}-t_{(j-1)}\right)\sum_{i\in R_{j}}\pi_{i}^{(m)}e^{\textbf{z}'_{i}(t_{(j)})\boldsymbol{\beta}}}, \hbox{ for } t\in(t_{(j-1)},t_{(j)}], \tag{2}\] where \(t_{(1)}\leq...\leq t_{(k)}\) are the \(k\) ordered event times and \(R_j\) is the risk set at \(t_{(j)}^-\) (i.e., the set of all individuals who did not experience the event of interest and have not been censored just prior to time \(t_{(j)}\)); and then (ii) the function \(h_0\) in \(L_2\) is replaced by its estimator given in (2) to obtain the following partial likelihood, which does not depend on the function \(h_0\) anymore, \[\label{eq:partial_like} \tilde{L}_{2}\left(\boldsymbol{\beta}\left|\boldsymbol{\pi}_{i}^{(m)}\right.\right)= \prod_{j=1}^{k}\frac{e^{\textbf{z}_{i}(t_{(j)})\boldsymbol{\beta}}}{\sum_{i\in R_{j}}\pi_{i}^{(m)}e^{\textbf{z}'_{i}(t_{(j)})\boldsymbol{\beta}}}. \tag{3}\] Finally, the latency component is estimated by maximizing (3) with respect to \(\boldsymbol{\beta}\). In case of tied event-times, (2) and (3) can be rewritten using the Breslow (1974) or Efron (1977) approximation as in the standard Cox’s PH model.
The EM algorithm terminates whenever \(\left\|\hat{\textbf{b}}^{(m)}-\hat{\textbf{b}}^{(m-1)}\right\|_2 < \epsilon\) and \(\left\|\hat{\boldsymbol{\beta}}^{(m)}-\hat{\boldsymbol{\beta}}^{(m-1)}\right\|_2 < \epsilon\), where \(\epsilon\) is a tolerance threshold (by default \(10^{-6}\)).
When the number of available covariates is large, fitting all possible subsets to find the most relevant covariates would be too time consuming. Beretta and Heuchenne (2019b) proposed a regularization method based on the maximization of a penalized version of the complete-data log-likelihood
\[\ell_C^P \left(\boldsymbol{\theta};\lambda_{1},\lambda_{2}\right) = \underbrace{\ell_1 \left(\textbf{b}\right) -n\sum_{j=2}^{q_1+1}p_{\lambda_{1}}\left(|b_{j}|\right)}_{\ell_1^P \left(\textbf{b};\lambda_{1}\right)} + \underbrace{\ell_2 \left(\boldsymbol{\beta},\textbf{h}_{0}\right) - n\sum_{l=1}^{q_2}p_{\lambda_{2}}\left(|\beta_{l}|\right)}_{\ell_2^P \left(\boldsymbol{\beta},\textbf{h}_{0};\lambda_{2}\right)},\] where \(\ell\) denotes a log-likelihood and \(p_\lambda(.)\) a SCAD penalty function, which role is to shrink the small coefficients toward zero. We assume that the \(q_1\) and \(q_2\) covariates in the incidence and latency component, respectively, have been standardized, such that the coefficients in \(\mathbf{b}\) and \(\boldsymbol{\beta}\) are on the same scale. The Smoothly Clipped Absolute Deviation (SCAD) penalty function (Fan and Li 2001) is given by \[\label{eq:SCAD_penalty} p_{\lambda}(|\beta_{j}|)=\begin{cases} \lambda|\beta_{j}|, & \textrm{if }|\beta_{j}|\leq\lambda\\ \frac{(a^{2}-1)\lambda^{2}-(|\beta_{j}|-a\lambda)^{2}}{2(a-1)}, & \textrm{if }\lambda<|\beta_{j}|\leq a\lambda\\ \frac{(a+1)\lambda^{2}}{2}, & \textrm{if }|\beta_{j}|>a\lambda \end{cases}, \tag{4}\] for some \(a>2\) and \(\lambda>0\). As explained by Fan and Li (2001; Fan and Li 2002) in the context of linear regression, generalized linear models, Cox’s PH model, and frailty models, the SCAD estimator has three desirable properties: unbiasedness (do not penalize to much large coefficients), sparsity, and continuity. Moreover, with a proper choice of the tuning parameters \((\lambda,a)\), it also possesses what is known as the “oracle property”, meaning that the SCAD estimator is asymptotically equivalent to the oracle estimator (i.e., an estimator with only the relevant variables with nonzero coefficients).
For fixed values of the tuning parameters
\((\lambda_1,\lambda_2,a_1,a_2)\), estimates of \(\boldsymbol{\theta}\) can
be obtained using an EM algorithm close to the one described in the
previous section. The only difference lies in the M-step. Given that the
penalized log-likelihoods \(\ell_1^P \left(\textbf{b};\lambda_{1}\right)\)
and \(\tilde{\ell}_2^P \left(\boldsymbol{\beta};\lambda_{2}\right)\),
where \(\tilde{\ell}_2^P \left(\boldsymbol{\beta};\lambda_{2}\right)\) is
the logarithm of ((3)), are non-concave and
non-differentiable at the origin, \(\textbf{b}\) and \(\boldsymbol{\beta}\)
are now estimated using the MM algorithm of Hunter and Li (2005) based on a
perturbed Local Quadratic Approximation (LQA) of the penalty function.
By default, when the argument pen.type
of the function penPHcure
is
set equal to "SCAD"
, the initial values for \(\mathbf{b}\) and
\(\boldsymbol{\beta}\) are vectors with all elements equal to zero.
Otherwise, the user can specify other values (e.g., the estimated
coefficients of the model with all covariates) using the argument SV
.
Regarding the choice of the tuning parameters, following the suggestion of Fan and Li (2001), we keep \(a_1=a_2=3.7\), and given a set of possible values for \((\lambda_1,\lambda_2)\), we select the ones that minimize the following Akaike (AIC) or Bayesian (BIC) Information Criteria: \[\mathrm{AIC}\left(\lambda_{1},\lambda_{2}\right)=-2\ell\left(\hat{\boldsymbol{\theta}}_{\lambda_{1},\lambda_{2}}\right) + 2 \nu; \ \hbox{ and}\]
\[\mathrm{BIC}\left(\lambda_{1},\lambda_{2}\right)=-2\ell\left(\hat{\boldsymbol{\theta}}_{\lambda_{1},\lambda_{2}}\right) + \ln(n) \nu,\] where \(\ell\left(\hat{\boldsymbol{\theta}}_{\lambda_{1},\lambda_{2}}\right)\) is the observed data log-likelihood evaluated at the penalized MLE \(\hat{\boldsymbol{\theta}}_{\lambda_{1},\lambda_{2}}\) and \(\nu\) is the number of nonzero coefficients, identified as the number of coefficients with an absolute value greater than a given threshold (by default \(10^{-6}\)).
Let \(\mathbf{S}=\{s_1,s_2,...,s_J\}\) be a partition of the time scale
forming \(J+1\) intervals \((0,s_1]\), \((s_1,s_2]\), \(...\), \((s_{J-1},s_J]\),
\((s_{J},\infty]\). Define a vector of time-varying covariates piecewise
constant in each interval: \(\mathbf{z}(t)=\mathbf{z}_j\), for
\(t\in(s_{j-1},s_j]\). Consider a transformation \(g\), such that \(g(0)=0\),
\(g(t)\) is strictly increasing for \(t>0\), and \(g^{-1}(t)\) is
differentiable. In the implementation of the penPHcure.simulate
function, we use \(g(t)=t^{1/\gamma}\), where the parameter \(\gamma\) can
be specified by the user via the argument gamma
, which by default is
equal to 1. According to Hendry (2014), if we generate a random variable
\(V\) as a piecewise exponential distribution with density function given
by
\[f_V(t) = \prod_{l=1}^{j-1} \exp\{-\lambda_l[g^{-1}(s_l)-g^{-1}(s_{l-1})]\} \lambda_j \exp\{-\lambda_j[t-g^{-1}(s_{j-1})]\}, \ \ \hbox{for } t\in(g^{-1}(s_{j-1}),g^{-1}(s_j)],\]
where \(\lambda_j=\exp(\mathbf{z}_j'\boldsymbol{\beta})\) is the constant
hazard in the interval \((g^{-1}(s_{j-1}),g^{-1}(s_{j})]\), then \(g(V)\)
follows a Cox’s PH model with time-varying covariates with a baseline
hazard function given by \(h_0(t)=\frac{d}{dt}[g^{-1}(t)]\). This method
is part of the algorithm implemented in the penPHcure.simulate
function to simulate data from a PH cure model with time-varying
covariates (see Table 1 for a detailed description).
Require: \(N\), sample size; \(\mathbf{S}\), partition of the time scale; \(g(t)\), variable transformation; \(\mathbf{b}\), incidence coefficients; \(\boldsymbol{\beta}\), latency coefficients. |
for \(i = 1,...,N\) do |
|
if \(y_i=0\) or \(w_i>c_i\) then |
\(t_i = c_i\); |
\(\delta_i=0\); |
else |
\(t_i = w_i\); |
\(\delta_i=1\); |
end if |
end for |
Return: \(\{(t_i,\delta_i,\mathbf{z}_i,\mathbf{x}_i); i=1,...,N\}\) |
In this section, we present the results of a simulation study conducted
to assess the finite sample performance of the PH cure model estimation
and its variable selection technique implemented in the penPHcure
function. The event-times follow a Cox’s PH model with baseline hazard
function \(h_0(t)=3t^2\) and 8 time-varying covariates. These covariates
are constant within \(J=30\) equally-spaced intervals \((0,s_1]\),
\((s_1,s_2]\), \(...\), \((s_{J-1},s_J]\), where \(s_1=0.2\) and \(s_J=6\). They
follow a multivariate normal distribution
\(\mathbf{z}_{i,j}\sim N(\mathbf{0},\boldsymbol{\varSigma})\), where
\(\varSigma_{p,q}=0.5^{|p-q|}\), for \(p,q=1,...,8\). The censoring times
follow an exponential distribution truncated above \(6\) and with
parameter \(\lambda_C\). The cure indicators are generated from a logistic
regression model with 8 time-fixed covariates that follow a multivariate
normal distribution
\(\mathbf{x}_{i}\sim N(\mathbf{0},\boldsymbol{\varSigma})\), where
\(\varSigma_{p,q}=0.5^{|p-q|}\), for \(p,q=1,...,8\). Finally, the
regression coefficients vectors are set equal to
\(\boldsymbol{\beta}_0=(-0.7,0,1,0,-0.5,0.75,0,0)'\) and
\(\mathbf{b}_0=(b_0,1.5,0,-0.75,0,-1.5,0,0.75,0)'\).
We consider 6 simulation settings, with different levels of censoring
and proportions of non-susceptible individuals (expressed as a fraction
of the sample size), depending on different values of \(b_0\) and
\(\lambda_C\) (see Table 2). For each of these
settings, we generated 500 replications using the penPHcure.simulate
function for different sample sizes \(N=\{250,500,1000\}\). Then, for all
simulated datasets, we use the penPHcure
function to (i) fit a
standard PH cure model with all covariates (FULL), (ii) fit a standard
PH cure model with the covariates associated to the non-zero
coefficients only (ORACLE), and (iii) to perform variable selection
using the regularization method with SCAD penalties and tuning
parameters chosen according to the BIC criterion. The possible values of
the tuning parameters \((\lambda_1,\lambda_2)\) are obtained with the
function exp(seq(-6, -1, length.out = 10))
, whereas \((a_1,a_2)\) are
kept equal to \(3.7\). Furthermore, we use the coxph
function in the
survival package
(Therneau 2015) to fit the classical Cox’s PH model with the covariates
associated to the non-zero coefficients only (COX).
Censoring | Cure | \(\lambda_C\) | \(b_0\) |
---|---|---|---|
Low (40%) | High (30%) | 0.02 | 1.45 |
Low (40%) | Medium (20%) | 0.3 | 2.35 |
Medium (60%) | High (45%) | 0.35 | 0.35 |
Medium (60%) | Medium (30%) | 0.75 | 1.45 |
High (80%) | High (60%) | 0.95 | -0.7 |
High (80%) | Medium (40%) | 1.55 | 0.7 |
The performance is measured in terms of Mean Estimation Error (MEE) and average number of correct and incorrect zeros identified by the variable selection technique (SCAD). In particular, the estimation error for the incidence component is computed as \(E\left[ \left( \hat{p}(\mathbf{x}) - p_0(\mathbf{x}) \right)^2 \right]\), where \(\hat{p}(\mathbf{x})\) and \(p_0(\mathbf{x})\) are the estimated and true probabilities of being susceptible. Whereas the estimation error for the latency component is computed as \(E\left[ \left( \hat{S}(T|Y=1) - S_0(T|Y=1) \right)^2 \right]\), where \(\hat{S}(T|Y=1)\) and \(S_0(T|Y=1)\) are the estimated and true survival functions conditional on being susceptible.
In 1,2, we provide the
MEEs for the incidence and latency components, respectively, while in
3, we provide the average number of correct and
incorrect zeros. From those figures, we can see that the PH cure model
estimation and its variable selection technique implemented in the
penPHcure
function perform reasonably well. For an increase of the
sample size or a decrease of the level of censoring, the MEE decreases,
and the number of correct (resp. incorrect) zeros converges to 4 (resp.
0). The MEEs of the ORACLE model are always the lowest ones, but we
notice that the ones of the SCAD method tend towards them as the sample
size increases. It is important to note that, for a fixed level of
censoring, we observe higher MEEs in the case of a lower fraction of
cured individuals. The worst results are obtained in situations of high
censoring and low cure rates, but it is enough to increase the sample
size to obtain better results. This is evidence of the fact that a cure
model should always be applied to data with a sufficient number of
non-susceptible individuals. Last but not the least important, we note
that the use of the classical Cox’s PH model (COX) leads to very high
errors. This was expected since the model is wrongly specified as it
ignores the existence of cured subjects.
Finally, in Table 3, we also present the coverage probabilities of the estimated 95% confidence intervals for the ORACLE model using the basic and percentile bootstrap methods with 500 resamples. In most cases, the basic bootstrap method outperforms the percentile bootstrap method, especially for the smallest sample sizes, with coverage probabilities closer to the 95% nominal level.
The R code used to obtain the results in
1,2,3
(resp. Table 3) are provided in Section 2
(resp. Section 3) of the supplementary material (beretta-heuchenne.R
).
Moreover, in the file beretta-heuchenne-suppl.pdf
, we provide a table
with all the results contained in
1,2,3.
cens | cure | N | Method | \(b_0\) | \(b_1\) | \(b_2\) | \(b_3\) | \(b_4\) | \(\beta_1\) | \(\beta_2\) | \(\beta_3\) | \(\beta_4\) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0.40 | 0.30 | 250 | basic | 0.97 | 0.97 | 0.98 | 0.97 | 0.97 | 0.96 | 0.97 | 0.96 | 0.94 |
perc | 0.9 | 0.91 | 0.94 | 0.91 | 0.94 | 0.93 | 0.94 | 0.95 | 0.94 | |||
500 | basic | 0.97 | 0.95 | 0.95 | 0.94 | 0.96 | 0.96 | 0.96 | 0.96 | 0.96 | ||
perc | 0.91 | 0.91 | 0.94 | 0.92 | 0.94 | 0.94 | 0.91 | 0.95 | 0.95 | |||
1000 | basic | 0.95 | 0.96 | 0.95 | 0.97 | 0.97 | 0.93 | 0.95 | 0.95 | 0.96 | ||
perc | 0.94 | 0.94 | 0.93 | 0.95 | 0.97 | 0.93 | 0.95 | 0.95 | 0.95 | |||
0.40 | 0.20 | 250 | basic | 0.95 | 0.96 | 0.99 | 0.96 | 0.97 | 0.96 | 0.97 | 0.97 | 0.96 |
perc | 0.89 | 0.9 | 0.94 | 0.91 | 0.93 | 0.94 | 0.96 | 0.96 | 0.96 | |||
500 | basic | 0.98 | 0.97 | 0.97 | 0.97 | 0.98 | 0.95 | 0.96 | 0.96 | 0.96 | ||
perc | 0.93 | 0.94 | 0.94 | 0.94 | 0.94 | 0.94 | 0.94 | 0.93 | 0.93 | |||
1000 | basic | 0.97 | 0.96 | 0.96 | 0.95 | 0.95 | 0.94 | 0.93 | 0.96 | 0.95 | ||
perc | 0.95 | 0.94 | 0.94 | 0.93 | 0.94 | 0.94 | 0.94 | 0.96 | 0.95 | |||
0.60 | 0.45 | 250 | basic | 0.98 | 0.95 | 0.96 | 0.96 | 0.97 | 0.98 | 0.98 | 0.97 | 0.97 |
perc | 0.95 | 0.91 | 0.93 | 0.93 | 0.94 | 0.93 | 0.94 | 0.94 | 0.94 | |||
500 | basic | 0.98 | 0.95 | 0.94 | 0.97 | 0.96 | 0.94 | 0.94 | 0.96 | 0.95 | ||
perc | 0.96 | 0.93 | 0.94 | 0.92 | 0.93 | 0.94 | 0.92 | 0.95 | 0.95 | |||
1000 | basic | 0.95 | 0.96 | 0.96 | 0.95 | 0.95 | 0.95 | 0.94 | 0.93 | 0.95 | ||
perc | 0.95 | 0.94 | 0.94 | 0.93 | 0.94 | 0.94 | 0.92 | 0.94 | 0.93 | |||
0.60 | 0.30 | 250 | basic | 0.96 | 0.94 | 0.98 | 0.94 | 0.99 | 0.97 | 0.96 | 0.96 | 0.96 |
perc | 0.89 | 0.9 | 0.91 | 0.89 | 0.93 | 0.94 | 0.94 | 0.95 | 0.95 | |||
500 | basic | 0.96 | 0.97 | 0.98 | 0.95 | 0.97 | 0.96 | 0.96 | 0.95 | 0.95 | ||
perc | 0.94 | 0.91 | 0.94 | 0.93 | 0.93 | 0.95 | 0.96 | 0.95 | 0.95 | |||
1000 | basic | 0.96 | 0.96 | 0.96 | 0.95 | 0.96 | 0.96 | 0.95 | 0.95 | 0.96 | ||
perc | 0.95 | 0.94 | 0.94 | 0.92 | 0.94 | 0.94 | 0.94 | 0.94 | 0.95 | |||
0.80 | 0.60 | 250 | basic | 0.99 | 0.97 | 0.98 | 0.97 | 0.97 | 0.97 | 0.96 | 0.97 | 0.97 |
perc | 0.94 | 0.91 | 0.92 | 0.91 | 0.92 | 0.94 | 0.92 | 0.96 | 0.94 | |||
500 | basic | 0.98 | 0.95 | 0.97 | 0.95 | 0.97 | 0.97 | 0.96 | 0.98 | 0.98 | ||
perc | 0.94 | 0.93 | 0.95 | 0.93 | 0.94 | 0.95 | 0.94 | 0.95 | 0.96 | |||
1000 | basic | 0.96 | 0.95 | 0.96 | 0.94 | 0.96 | 0.96 | 0.95 | 0.94 | 0.96 | ||
perc | 0.93 | 0.95 | 0.94 | 0.93 | 0.95 | 0.95 | 0.94 | 0.93 | 0.93 | |||
0.80 | 0.40 | 250 | basic | 0.98 | 0.95 | 0.99 | 0.98 | 1 | 0.97 | 0.96 | 0.99 | 0.99 |
perc | 0.92 | 0.88 | 0.91 | 0.91 | 0.93 | 0.95 | 0.93 | 0.97 | 0.95 | |||
500 | basic | 0.97 | 0.95 | 0.97 | 0.96 | 0.98 | 0.96 | 0.96 | 0.97 | 0.96 | ||
perc | 0.94 | 0.9 | 0.93 | 0.91 | 0.93 | 0.94 | 0.94 | 0.96 | 0.94 | |||
1000 | basic | 0.96 | 0.96 | 0.97 | 0.95 | 0.95 | 0.98 | 0.95 | 0.93 | 0.96 | ||
perc | 0.96 | 0.95 | 0.95 | 0.94 | 0.93 | 0.94 | 0.94 | 0.93 | 0.93 |
In this section, we illustrate the use of the penPHcure R package
using a Criminal Recidivism dataset, which contains a sample of 432
inmates released from Maryland state prisons and followed for one year
after release (Rossi et al. 1980). The aim of this study was to investigate
the relationship between the time to first arrest after release and some
covariates observed during the follow-up period. In particular, to study
the effect of providing financial aid at the moment of release. The
original source of the data is the Rossi
dataset in the
RcmdrPlugin.survival
package (Fox and Carvalho 2012), which has been converted into a counting
process format and included in the penPHcure package.
Let us load and illustrate the dataset:
> library(penPHcure)
> data("cpRossi", package = "penPHcure")
> head(cpRossi)
id tstart tstop arrest fin age race wexp mar paro prio educ emp1 1 0 20 yes no 27 black no no yes 3 3 no
2 2 0 9 no no 18 black no no yes 8 4 no
3 2 9 14 no no 18 black no no yes 8 4 yes
4 2 14 17 yes no 18 black no no yes 8 4 no
5 3 0 16 no no 19 other yes no yes 13 3 no
6 3 16 17 no no 19 other yes no yes 13 3 yes
> str(cpRossi)
'data.frame': 1405 obs. of 13 variables:
$ id : int 1 2 2 2 3 3 3 4 4 4 ...
$ tstart: int 0 0 9 14 0 16 17 0 4 21 ...
$ tstop : int 20 9 14 17 16 17 25 4 21 31 ...
$ arrest: Factor w/ 2 levels "no","yes": 2 1 1 2 1 1 2 1 1 1 ...
$ fin : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 2 2 ...
$ age : int 27 18 18 18 19 19 19 23 23 23 ...
$ race : Factor w/ 2 levels "black","other": 1 1 1 1 2 2 2 1 1 1 ...
$ wexp : Factor w/ 2 levels "no","yes": 1 1 1 1 2 2 2 2 2 2 ...
$ mar : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 1 1 1 ...
$ paro : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
$ prio : int 3 8 8 8 13 13 13 1 1 1 ...
$ educ : Factor w/ 3 levels "3","4","5": 1 2 2 2 1 1 1 3 3 3 ...
$ emp : Factor w/ 2 levels "no","yes": 1 1 2 1 1 2 1 1 2 1 ...
The object cpRossi
is a data.frame
in counting process format with
1405 observations for 432 individuals on 13 variables. The id
variable
provides the unique identification number for every individual in the
study. The variables tstart
and tstop
denote the time interval of
the observation (measured in weeks). The variable arrest
denotes
whether the individual has been arrested during the 1-year follow-up
period. The remaining explanatory variables are described hereafter.
fin
. Financial aid received after release: yes
or no
;age
. Age in years at the time of release;race
. Race of the individual: black
or other
;wexp
. Full-time work experience before incarceration: yes
or
no
;mar
. Married at the time of release: yes
or no
;paro
. Released on parole: yes
or no
;prio
. Number of convictions prior to incarceration;educ
. Level of education: <=9th degree (“3”), 10th or 11th degree
(“4”), or >=12th degree (“5”);emp
. Working full time during the observed time interval: yes
or
no
. This is the only variable which is varying over time (e.g.,
the individual with id = 2
did not work full time during the first
9 weeks after release, then he did for 5 weeks, and, finally, he has
been arrested after 3 weeks without working full time).Using the penPHcure
function, by default, we can fit the standard PH
cure model. First, we use a formula object with the response on the left
of the tilde operator and the explanatory variables to be included in
the latency component on the right. The response is a survival object
returned by the Surv(tstart,tstop,arrest)
function. Then, using the
argument cureform
, we specify the explanatory variables to be included
in the incidence component. By default, these covariates are set equal
to the last observation, but in this case, we set the argument
which.X = "mean"
to compute the time-weighted average over the full
history. Finally, setting the argument inference = TRUE
, we conduct
inference about the parameter estimates via bootstrapping (by default,
100 bootstrap resamples). The user can increase/decrease the number of
bootstrap resamples with the argument nboot
.
> set.seed(123) # for reproducibility
> fit <- penPHcure(Surv(tstart,tstop,arrest)~fin+age+race+wexp+mar+paro+prio+educ+emp,
+ cureform = ~fin+age+race+wexp+mar+paro+prio+educ+emp,
+ data = cpRossi,which.X = "mean",inference = TRUE)
-varying covariates...
Initializing PH cure model with time
: 432
Number of individuals: 0.736
Censoring proportion: TRUE
Tied failure times: 49
Number of unique failure timesin the survival component: 10
Number of covariates in the cure component: 10
Number of covariates
Checking starting values...
-varying covariates... Please wait...
Fitting standard PH cure model with time
Performing inference via bootstrapping... Please wait ...|==============================================================================| 100%
! DONE
This call to the penPHcure
function returned an object of class
"PHcure"
, and we can print a summary of the results using the
summary
method. By default, confidence intervals are computed using
the basic bootstrap method (the alternative is percentile bootstrap) and
a confidence level of 95%. In order to control these features, the user
can provide the arguments conf.int
and conf.int.level
, respectively.
> summary(fit)
------------------------------------------------------
+++ PH cure model with time-varying covariates +++
------------------------------------------------------
: 432
Sample size: 0.7361111
Censoring proportion: 49
Number of unique event times: TRUE
Tied failure times
-likelihood: -643.65
log
------------------------------------------------------
+++ Cure (incidence) coefficient estimates +++
+++ and 95% confidence intervals * +++
------------------------------------------------------
2.5% 97.5%
Estimate 1.136709 -34.041743 9.769052
(Intercept) -0.455199 -2.299870 12.188817
finyes -0.067715 -0.382429 0.413001
age -0.100950 -2.988104 35.336024
raceother 0.251663 -3.257339 2.193371
wexpyes 0.261947 -15.041406 35.102574
marno -0.041289 -3.156395 1.659637
paroyes 0.068443 -0.285553 0.237089
prio -0.570782 -2.614311 2.532353
educ4 -1.163257 -39.473732 34.360612
educ5 -0.860659 -3.299354 1.216299
empyes
------------------------------------------------------
+++ Survival (latency) coefficient estimates +++
+++ and 95% confidence intervals * +++
------------------------------------------------------
2.5% 97.5%
Estimate 0.062630 -1.427436 1.446067
finyes 0.046192 -0.067209 0.176043
age -0.759985 -2.770654 0.720247
raceother -0.552549 -1.672866 0.576657
wexpyes 0.123655 -2.327914 1.600195
marno 0.040388 -0.816177 1.110058
paroyes 0.048407 -0.107942 0.195763
prio 0.588156 -0.545885 1.881803
educ4 0.838098 -2.527512 5.118107
educ5 -1.431782 -1.980471 -0.781978
empyes
------------------------------------------------------
* Confidence intervals computed by the basic
100 replications.
bootstrap method, with ------------------------------------------------------
As you can see, only one covariate (emp
) in the latency component is
statistically significant (the 95% confidence interval does not include
zero). The negative sign of the estimated coefficient implies that the
individuals working full time after release have a lower risk of being
rearrested (among the individuals susceptible to be rearrested). The
lack of significance of the other covariates might be explained by the
small sample size, the high level of censoring (only 114 out of 432
individuals have been rearrested), or by potential confounding factors.
We now perform variable selection with the proposed SCAD-penalized
likelihood method to check whether other covariates may be relevant to
explain incidence and latency. First, we specify the possible values of
the tuning parameters (using the argument pen.tuneGrid
) and set the
starting values equal to the coefficient estimates from the unpenalized
model (using the argument SV
). Then, we still use the penPHcure
function, but we now include the argument pen.type = "SCAD"
.
> pen.tuneGrid <- list(CURE = list(lambda = seq(0.01,0.12,by=0.01),
+ a = 3.7),
+ SURV = list(lambda = seq(0.01,0.12,by=0.01),
+ a = 3.7))
> SV <- list(b=fit$b,beta=fit$beta)
> tuneSCAD <- penPHcure(Surv(tstart,tstop,arrest)~fin+age+race+wexp+mar+paro+prio+educ+emp,
+ cureform = ~fin+age+race+wexp+mar+paro+prio+educ+emp,
+ data = cpRossi,which.X = "mean",pen.type = "SCAD",
+ pen.tuneGrid = pen.tuneGrid,SV = SV)
-varying covariates...
Initializing PH cure model with time
: 432
Number of individuals: 0.736
Censoring proportion: TRUE
Tied failure times: 49
Number of unique failure timesin the survival component: 10
Number of covariates in the cure component: 10
Number of covariates
Checking starting values...
-penalized PH cure model with time-varying covariates... Please wait...
Tuning SCAD
iter aCURE aSURV lambdaCURE lambdaSURV AIC BIC df1 3.70 3.70 0.01 0.01 1319.1625 1384.2573 16
2 3.70 3.70 0.01 0.02 1319.1625 1384.2573 16
3 3.70 3.70 0.01 0.03 1316.0665 1360.8192 11
4 3.70 3.70 0.01 0.04 1318.0458 1358.7300 10
5 3.70 3.70 0.01 0.05 1318.0457 1358.7300 10
... (omitted rows) ... ... (omitted rows) ... ...
...
140 3.70 3.70 0.12 0.08 1325.5349 1333.6718 2
141 3.70 3.70 0.12 0.09 1325.5349 1333.6718 2
142 3.70 3.70 0.12 0.10 1325.5349 1333.6718 2
143 3.70 3.70 0.12 0.11 1325.5349 1333.6718 2
144 3.70 3.70 0.12 0.12 1325.5349 1333.6718 2
! DONE
This time, the call to the penPHcure
function returned an object of
class "penPHcure"
. We can print a summary of the results using the
summary
method, and, by default, the fitted model with the lowest BIC
criterion is returned.
> summary(tuneSCAD)
------------------------------------------------------
+++ PH cure model with time-varying covariates +++
+++ [ Variable selection ] +++
------------------------------------------------------
: 432
Sample size: 0.7361111
Censoring proportion: 49
Number of unique event times: TRUE
Tied failure times: SCAD
Penalty type: BIC
Selection criterion
------------------------------------------------------
+++ Tuning parameters +++
------------------------------------------------------
Cure (incidence) --- lambda: 0.09
: 3.7
a
Survival (latency) - lambda: 0.05
: 3.7
a
= 1329.481
BIC
------------------------------------------------------
+++ Cure (incidence) +++
+++ [ Coefficients of selected covariates ] +++
------------------------------------------------------
Estimate1.776907
(Intercept) -0.076498
age
------------------------------------------------------
+++ Survival (latency) +++
+++ [ Coefficients of selected covariates ] +++
------------------------------------------------------
Estimate0.101202
prio -1.537286 empyes
The Bayesian Information Criterion is minimized for \(\lambda_1=0.09\) and
\(\lambda_1=0.05\). In this case, the covariate age
is selected in the
incidence component. The negative sign of the estimated coefficient
implies that younger individuals are more susceptible to be rearrested.
The covariates prio
and emp
are selected in the latency component.
The positive sign of the estimated coefficient (prio
) implies that a
higher number of convictions prior to incarceration increases the risk
of being rearrested (among the individuals susceptible to be
rearrested).
Let us now have a look at the fitted model with the lowest AIC criterion:
> summary(tuneSCAD,crit.type = "AIC")
------------------------------------------------------
+++ PH cure model with time-varying covariates +++
+++ [ Variable selection ] +++
------------------------------------------------------
: 432
Sample size: 0.7361111
Censoring proportion: 49
Number of unique event times: TRUE
Tied failure times: SCAD
Penalty type: AIC
Selection criterion
------------------------------------------------------
+++ Tuning parameters +++
------------------------------------------------------
Cure (incidence) --- lambda: 0.06
: 3.7
a
Survival (latency) - lambda: 0.03
: 3.7
a
= 1310.79
AIC
------------------------------------------------------
+++ Cure (incidence) +++
+++ [ Coefficients of selected covariates ] +++
------------------------------------------------------
Estimate1.829260
(Intercept) -0.585638
finyes -0.067130
age -0.887636
educ5
------------------------------------------------------
+++ Survival (latency) +++
+++ [ Coefficients of selected covariates ] +++
------------------------------------------------------
Estimate-0.586626
raceother 0.103746
prio -1.552737 empyes
The Akaike Information Criterion is minimized for \(\lambda_1=0.06\) and
\(\lambda_1=0.03\). As expected the AIC criterion selected a less
penalized and more complex model. In the incidence component, also the
covariates fin
and educ
have been selected. The negative signs imply
that individuals who received financial aid or with a high level of
education (>=12th degree) are less susceptible to be rearrested. In the
latency component, also the covariate race
has been selected. The
negative coefficient implies that individuals of a race other than black
have a lower risk of being rearrested (among the individuals susceptible
to be rearrested).
In survival analysis studies, it may be the case that a fraction of the
population is likely to be not susceptible to the event of interest. In
this article, we presented the penPHcure R package, which implements
the semiparametric proportional-hazards (PH) cure model of
Sy and Taylor (2000) extended to time-varying covariates. This model can
measure the effects of some covariates on the probability of being
susceptible and on the time until the occurrence of the event. The
penPHcure package is composed of two main functions: penpHcure
, to
estimate the regression coefficients, their confidence intervals using
the basic/percentile bootstrap method and to perform variable selection
using the SCAD-penalized likelihood technique proposed by
Beretta and Heuchenne (2019b); and penpHcure.simulate
to simulate data from
a PH cure model with time-dependent covariates. We first explained the
methodology behind these functions and presented the results of a
simulation study to assess its finite-sample performance. Then, we
illustrated the use of the penPHcure
function through an example based
on the Criminal Recidivism dataset.
The latest release and a development version of the penPHcure package are respectively avilable on CRAN and at https://github.com/a-beretta/penPHcure.
penPHcure, flexsurvcure, flexsurv, nltm, smcure, spduration, survival, RcmdrPlugin.survival
ClinicalTrials, Distributions, Econometrics, Survival
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Beretta & Heuchenne, "penPHcure: Variable Selection in Proportional Hazards Cure Model with Time-Varying Covariates", The R Journal, 2021
BibTeX citation
@article{RJ-2021-061, author = {Beretta, Alessandro and Heuchenne, Cédric}, title = {penPHcure: Variable Selection in Proportional Hazards Cure Model with Time-Varying Covariates}, journal = {The R Journal}, year = {2021}, note = {https://rjournal.github.io/}, volume = {13}, issue = {1}, issn = {2073-4859}, pages = {53-66} }