In the presence of competing risks a prediction of the time-dynamic absolute risk of an event can be based on cause-specific Cox regression models for the event and the competing risks (Benichou and Gail 1990). We present computationally fast and memory optimized C++ functions with an R interface for predicting the covariate specific absolute risks, their confidence intervals, and their confidence bands based on right censored time to event data. We provide explicit formulas for our implementation of the estimator of the (stratified) baseline hazard function in the presence of tied event times. As a by-product we obtain fast access to the baseline hazards (compared to survival::basehaz()
) and predictions of survival probabilities, their confidence intervals and confidence bands. Confidence intervals and confidence bands are based on point-wise asymptotic expansions of the corresponding statistical functionals. The software presented here is implemented in the riskRegression package.
Predictions of hazards and risks based on a Cox regression analysis need
to be fast and memory efficient, especially in large data, in simulation
studies, and for cross-validation or bootstrap loops. The CRAN task view
Survival lists many R
packages implementing the Cox regression model and extensions thereof.
Among the most popular routines are the function coxph()
from the
survival package
(Therneau 2017) and the function cph()
from the
rms package (Harrell Jr 2017).
We present a fast and memory efficient algorithm to extract baseline
hazards and predicted risks with confidence intervals from an object
obtained with either of these functions.
In the presence of competing risks one needs to combine at least two Cox
regression models to predict the absolute risk of an event (cumulative
incidence) conditional on covariates (Benichou and Gail 1990). We
present the CSC()
-function of the R package
riskRegression
which fits the Cox regression models using either coxph()
or cph()
.
We also present a concomitant predict()
S3 method which computes the
absolute risks of the event of interest for given combinations of
covariate values and time points. Optionally, the predict()
method
computes asymptotic confidence intervals and confidence bands for the
predicted absolute risks. We review the formula behind the estimators
implemented and illustrate the R interface.
It is possible to obtain the predictions of absolute risks based on cause-specific Cox regression also with the survival package or with the mstate package (Putter et al. 2016). However, both require more work from the user. Finally, it should be noted that there are alternative regression methods for absolute risks in the presence of competing risks such as Fine-Gray regression (Fine and Gray 1999) or direct binomial regression (Scheike et al. 2008; Gerds et al. 2012).
For the sole purpose of illustration we use the Melanoma
data set
which is included in the riskRegression package. It contains data
from 205 malignant melanoma patients. Among the risk factors for cancer
specific death were patient age and sex and the histological variables
tumor thickness, invasion (levels 0,1,2), and epithelioid cells (no
present vs. present). Within the limitation of the follow-up periods, it
was observed that 57 patients had died from cancer (“status” equals 1)
and 14 had died from other causes (“status” equals 2). The remaining
patients were right censored (“status” equals 0).
library(riskRegression, verbose = FALSE, quietly = TRUE)
library(survival)
data(Melanoma)
str(Melanoma)
## 'data.frame': 205 obs. of 7 variables:
## $ time : int 10 30 35 99 185 204 210 232 232 279 ...
## $ status : int 3 3 2 3 1 1 1 3 1 1 ...
## $ sex : int 1 1 1 0 1 1 1 0 1 0 ...
## $ age : int 76 56 41 71 52 28 77 60 49 68 ...
## $ year : int 1972 1968 1977 1968 1965 1971 1972 1974 1968 1971 ...
## $ thickness: num 6.76 0.65 1.34 2.9 12.08 ...
## $ ulcer : int 1 0 0 0 1 1 1 1 1 1 ...
We denote by \(T\) the time between a baseline date and the date of an event and by \(D\in\{1,...,K\}\) the cause of the event. We assume that \(\{D=1\}\) is the event of interest. Let \(X=(X^{1}, \dots,X^{p})\) be a \(p\)-dimensional vector of baseline covariates with arbitrary distribution, and \(Z=(Z^1,\dots,Z^{q})\) be the strata variables, i.e. a set of categorical baseline covariates with finitely many possible values. Without loss of generality and to ease the notation we set \(q=1\). We use \(\{1,\dots,L\}\) for the categories of \(Z\).
We consider a setting in which the event time \(T\) is right censored at a random time \(C\). We assume that \(C\) is conditionally independent of \(T\) given \((X,Z)\) and fix a time \(\tau\) such that almost surely \(\mathrm P(C>\tau|X,Z)>0\). We denote \(\tilde T=\min(T,C)\), \(\tilde D = \Delta D\), and \(\Delta=1\{T\le C\}\).
Given covariates \((X,Z)\), let \(S_0(t|x,z)=P(T>t|X=x,Z=z)\) denote the event-free survival function and \(F_j(t|x,z)=P(T\le t, D=j|X=x,Z=z)\) the cumulative incidence function for event \(j\). The cause-specific hazard rates are defined as \(\lambda_{j,z}(t|x)=\mathrm d F_j(t|x,z)/S_0(t|x,z)\) (Andersen et al. 1993). We also denote the cumulative hazard rates by \(\Lambda_{j,z}(t|x)=\int_0^t\lambda_{j,z}(s|x)\mathrm d s\). The stratified Cox regression model (Cox 1972) for cause \(j\) is given by \[\label{sHazard} \lambda_{j,z}(t|x)= \lambda_{0j,z}(t)\exp(x\beta_j), \tag{1}\] where \(\beta_j={(\beta_j^1,\dots,\beta_j^p)}^\intercal\) is a \(p\)-dimensional vector of regression coefficients (the log-hazard ratios), and \(\{\lambda_{0j,z}(t): z=1,\dots,L\}\) a set of unspecified baseline hazard functions.
The cause-specific Cox regression models can be combined into a prediction of the absolute risk of an event of type \(1\) until time \(t\) conditional on the covariates \(x,z\). For the case where \(K=2\) the absolute risk formula of (Benichou and Gail 1990) is given by: \[\label{absRisk} F_1 (t|x,z)=\int_0^t S(s-|x,z)\lambda_{1,z}(s|x) \mathrm ds. \tag{2}\] where \(s-\) denotes the right sided limit, e.g. \(\Lambda_{1,z}(s-|x)=lim_{v \rightarrow s, v<s}\Lambda_{1,z}(v|x)\). The absolute risk accumulates over time the product between the event-free survival and the hazard of experiencing the event of interest, both conditional to the baseline covariates and to the strata variable. The event free survival can be estimated from the cause-specific hazards using the product integral estimator:
or the exponential approximation: \[\label{survivalEXP} \widehat{S}(t|x,z)=\exp \bigg[ - \widehat{\Lambda}_{1,z}(t|x) - \widehat{\Lambda}_{2,z}(t|x)\bigg]. \tag{3}\] which is asymptotically equivalent to the product-limit estimator if the distribution of the event times is continuous. Using the product integral estimator ensures that \(S(t|x,z)+F_1(t|x,z)+F_2(t|x,z)\) equals exactly \(1\). This is a desirable property since the sum of the transition probabilities over all possible transitions should sum to one.
Formula (2) generalizes to situations with more than 2 competing risks, i.e., \(K>2\). However, in applications with many competing risks there will sometimes be few events of specific causes, and it may be hard to fit a Cox regression model for each cause separately. One possibility when \(K>2\) is to combine all causes where \(\tilde D>1\) into a single competing risk for the cause of interest \(\tilde D=1\). While the riskRegression package allows the use of more than 2 competing risks, we will illustrate its use considering only 2 competing risks . The package implements formula (2) in two steps.
The first step is to fit the Cox regression models with the CSC()
function in order to estimate \(\lambda_{1,z}\) and \(\lambda_{2,z}\):
<- CSC(formula = Hist(time,status) ~ age + logthick + epicel + strata(sex),
cfit0 data = Melanoma)
coef(cfit0)
## $`Cause 1`
## age logthick epicelpresent
## 0.01548722 0.68178505 -0.73848649
##
## $`Cause 2`
## age logthick epicelpresent
## 0.07680909 0.04750975 0.31497177
In the call of CSC()
the argument formula
is used to define the
outcome variables with the help of the function prodlim::Hist()
. The
variable “time” in the data set contains the values of the observed
event time \(\tilde T\) and the variable “status” the cause of the event
\(\tilde D\). Objects generated with the function prodlim::Hist()
have a
print method:
<- with(Melanoma, prodlim::Hist(time,status))
h h
## Right-censored response of a competing.risks model
##
## No.Observations: 205
##
## Pattern:
##
## Cause event right.censored
## 1 57 0
## 2 14 0
## unknown 0 134
and a plot method:
plot(h, arrowLabelStyle = "count",
stateLabels = c("Radical\noperation", "Cancer\nrelated death", "Death\nother causes"))
A nice complement to the regression models is the marginal Aalen-Johansen estimate of the absolute risk of cancer related death (2):
library(prodlim)
plot(prodlim(Hist(time,status) ~1, data = Melanoma),
atrisk.at = seq(0,3652.5,365.25), xlim = c(0,3652.5),
axis1.at = seq(0,3652.5,365.25), axis1.lab = 0:10,
xlab = "Years", ylab = "Absolute risk", cause = 1)
The right hand side of the formula in the call of the CSC()
function:
Hist(time,status) ~ age + logthick + epicel + strata(sex)
defines the covariate(s) \(X\) which enter into the linear predictor
\(x\beta\) in formula (1), and the strata variable(s) \(Z\)
which define the baseline hazard functions \(\lambda_{0j,z}\). Strata
variables are specified by wrapping the variable names into the special
function strata()
, as one would do when using the coxph()
function.
If only one formula is provided, the CSC()
function will use the same
baseline covariates and strata variables for all cause-specific Cox
regression models. Instead one may feed a list of formulas into the
argument formula
, one for each cause:
<- CSC(formula = list(Hist(time,status) ~ age + logthick + epicel + strata(sex),
cfit1 Hist(time,status) ~ age + strata(sex)),
data = Melanoma)
coef(cfit1)
## $`Cause 1`
## age logthick epicelpresent
## 0.01548722 0.68178505 -0.73848649
##
## $`Cause 2`
## age
## 0.07919648
Note that the choice of the baseline covariates relative to each cause
made here is not based on clinical or statistical criteria; it was done
to illustrate the software possibilities. The causes are internally
ordered with respect to the levels of the variable “status”, if this
variable is a factor, and otherwise with respect to
sort(as.character(unique(status)))
. The order of the causes is saved
as cfit1[["causes"]]
. Accordingly, the first formula is used to fit a
Cox regression model to the first cause and the second formula is used
to fit a Cox regression model to the second cause and so on. Internally,
CSC()
constructs dummy variables, one for each cause, and then calls
the function defined by the argument fitter
on a suitably constructed
Surv()
formula. By default the cause-specific Cox models are fitted
with the function survival::coxph()
. Alternatively, one can set the
argument fitter
to the name of a different routine, e.g., cph
.
The object obtained with CSC()
has class "CauseSpecificCox"
. The
second step is to call the corresponding predict()
method. In addition
to the object obtained with CSC()
this requires three additional
arguments: newdata
, times
, cause
. The argument newdata
should be
a "data.frame"
which contains the covariates \(X\) and \(Z\) in the same
format as the data used to fit CSC()
. The argument cause
defines the
cause of interest \(D\) and the argument times
defines a vector of
prediction horizon(s) whose values are used as the upper integration
limit \(t\) in formula (2). The predict()
method computes
the absolute risks (formula (2)) for each row in newdata
and each value of times
:
<- data.frame(age = c(45,67), logthick = c(0.1,0.2),
newdata epicel = c("present","not present"),
sex = c("Female","Male"))
<- predict(cfit1, newdata = newdata, cause = 1, times = c(867,3500)) pfit1
By default, the product integral estimator is used to estimate the
event-free survival function. Setting the argument productLimit
to
FALSE
when calling the predict function enables to use the exponential
approximation. The predict function returns a structured list of class
"predictCSC"
. The corresponding print()
method calls
as.data.table.predictCSC()
to display the predictions as follows:
print(pfit1)
## observation age logthick epicel sex times strata absRisk
## 1: 1 45 0.1 present Female 867 sex=Female 0.021
## 2: 2 67 0.2 not present Male 867 sex=Male 0.149
## 3: 1 45 0.1 present Female 3500 sex=Female 0.117
## 4: 2 67 0.2 not present Male 3500 sex=Male 0.428
For each row in newdata
(values are repeated for each prediction
horizon) and each prediction horizon (column times
) the column
“absRisk” contains the absolute risk of cancer specific mortality (cause
1). Standard errors and confidence intervals for the absolute risk can
be obtained setting the argument se
to TRUE
:
<- predict(cfit1, newdata = newdata, cause = 1, times = c(867,3500),
pfit1se se = TRUE, keep.newdata = FALSE)
print(pfit1se)
## observation times strata absRisk absRisk.se absRisk.lower absRisk.upper
## 1: 1 867 sex=Female 0.021 0.122 0.00738 0.0478
## 2: 2 867 sex=Male 0.149 0.161 0.07356 0.2502
## 3: 1 3500 sex=Female 0.117 0.151 0.05552 0.2025
## 4: 2 3500 sex=Male 0.428 0.320 0.20416 0.6355
Here we have set the argument keep.newdata
to FALSE
to not export
the value of the covariates. The structure of the "predictCSC"
object
is as follows.
str(pfit1se)
## List of 11
## $ absRisk : num [1:2, 1:2] 0.021 0.149 0.117 0.428
## $ absRisk.se : num [1:2, 1:2] 0.122 0.161 0.151 0.32
## $ absRisk.lower : num [1:2, 1:2] 0.00738 0.07356 0.05552 0.20416
## $ absRisk.upper : num [1:2, 1:2] 0.0478 0.2502 0.2025 0.6355
## $ times : num [1:2] 867 3500
## $ strata : Factor w/ 2 levels "sex=Female","sex=Male": 1 2
## $ conf.level : num 0.95
## $ se : logi TRUE
## $ band : logi FALSE
## $ nsim.band : num 10000
## $ transformation.absRisk:function (x)
## - attr(*, "class")= chr "predictCSC"
The elements $absRisk
, $absRisk.se
, $absRisk.lower
and
$absRisk.upper
are matrices where each row corresponds to a row in
newdata
and each column to a value of the times
vector. All these
matrices are sorted according to the original orders of the arguments
newdata
and times
. To conveniently extract a subset of the results,
one should first call as.data.table.predictCSC()
to combine these
results into a "data.table"
object. Here is an example:
<- as.data.table(pfit1se)
ptable1 == 3500 & observation == 1,
ptable1[times .(times,absRisk,absRisk.lower,absRisk.upper)]
## times absRisk absRisk.lower absRisk.upper
## 1: 3500 0.1166383 0.05551508 0.2025222
In the same way confidence bands can be obtained by setting the argument
band
to TRUE
:
<- cfit1$eventTimes
vec.times <- predict(cfit1, newdata = newdata[1], cause = 1,
pfit1band times = vec.times, se = TRUE, band = TRUE)
1] newdata[
## age logthick epicel sex
## 1: 45 0.1 present Female
By default 10,000 simulations will be used to estimate the appropriate
quantile for the confidence bands (see explanations in section
Construction of the confidence bands). This default
behavior can be changed by setting the argument nsim.band
to another
value. The autoplot()
function can then be used to compare confidence
bands and the confidence intervals:
<- autoplot(pfit1band, band = TRUE, ci = TRUE)$plot
figure3 <- figure3 + xlab("Time (days)") + ylab("Absolute risk")
figure3 print(figure3)
Note that the resulting object is a "ggplot"
graphic. This can be
useful to personalize the graph, e.g. change the font size of the text.
In this section we describe the asymptotic formula behind the confidence intervals for the predicted absolute risks and the empirical counterpart which is implemented in riskRegression. We assume a sample \((\mathcal{X}_i)_{i \in \{1,\ldots,n\}}\) of \(n\) independent and identically distributed replications of \(\mathcal{X} = (\tilde T,\tilde D,X,Z)\). The estimator \(\hat F_1\) of \(F_1\) is obtained by substituting the Cox partial likelihood estimate \(\hat\beta_j\) for \(\beta_j\) and the baseline hazard estimate \(\hat\lambda_{0j,z}\) for \(\lambda_{0j,z}\) in equations (1) and (2).
The asymptotic confidence intervals for the covariate specific absolute
risks of event 1 before time \(t\) are based on the following von Mises
expansion (van der Vaart 1998):
\[\label{iid_absRisk}
\sqrt{n}(\hat{F}_1(t|x,z) - F_1(t|x,z)) = \frac{1}{\sqrt n}\sum_{i=1}^{n} \phi_{F_1}(\mathcal{X}_i;t,x,z) + o_{\mathrm P}(1). \tag{4}\]
where the exponential approximation is used for defining the event free
survival in \(F_1(t|x,z)\). Given (4), for fixed values
\(t,x,z\) the central limit theorem implies that \(\hat F_1(t|x,z)\) has an asymptotic normal distribution with asymptotic
variance \(V_{F_1}(t,x,z)=\mathrm E(\phi_{F_1}(\mathcal{X}_i;t,x,z)^2)\). Based on the estimate
\(\widehat{\phi}_{F_1}\) of the influence function \(\phi_{F_1}\) (both
defined in subsequent subsections) our variance estimate is given by
\[\label{var_absRisk}
\widehat V(t,x,z) ={\frac{1}{n}}\sum_{i=1}^{n}
\widehat{\phi}_{F_1}(\mathcal{X}_i;t,x,z)^2. \tag{5}\]
We then construct Wald confidence intervals for \(F_1(t|x,z)\) in the
usual way:
\[\left[ F_1(t|x,z) + q_{\alpha/2}\sqrt{\widehat V(t,x,z))}; F_1(t|x,z) + q_{1-\alpha/2} \sqrt{\widehat V(t,x,z)} \right]\]
where \(q_{\alpha}\) is the \(\alpha\)-quantile of the normal distribution.
Since the absolute risk is bounded below by \(0\) and above by \(1\), the
confidence interval is automatically restricted to this interval.
Alternatively the confidence interval can be computed using a log-log
transformation: first the confidence interval is computed on the log-log
scale:
\[\left[ \log(-\log(F_1(t|x,z)) ) + q_{\alpha/2}\sqrt{\widehat V_{log-log}(t,x,z))}; \log(-\log(F_1(t|x,z)) ) - q_{1-\alpha/2} \sqrt{\widehat V_{log-log}(t,x,z)} \right]\]
where \(V_{log-log}\) is the variance of the influence function on the
log-log scale. Then the confidence interval is back transformed using
the link function: \(x \mapsto \exp(-\exp(x))\). This ensures that the
confidence interval is bounded below by \(0\) and above by \(1\). By
default, the confidence intervals are computed using the log-log
transformation. To compute them without using the log-log
transformation, the argument log.transform
needs to be set to FALSE
when calling predict.CauseSpecificCox()
:
<- predict(cfit1, newdata = newdata, cause = 1, times = c(867,3500),
pfit2se se = TRUE, log.transform = FALSE, keep.newdata = FALSE)
print(pfit2se)
## observation times strata absRisk absRisk.se absRisk.lower absRisk.upper
## 1: 1 867 sex=Female 0.021 0.00992 0.00157 0.0404
## 2: 2 867 sex=Male 0.149 0.04586 0.05945 0.2392
## 3: 1 3500 sex=Female 0.117 0.03795 0.04226 0.1910
## 4: 2 3500 sex=Male 0.428 0.11620 0.20021 0.6557
Here the column “absRisk.se” contains \(V_{log-log}\) (log-log scale) while the columns “absRisk”, “absRisk.lower”, and “absRisk.upper” are on the original scale. The benefit of using a log-log transformation is studied in the section Coverage of the confidence interval and the confidence bands.
The influence function \(\phi_{F_1}\) can be expressed as a function of the influence functions \(\phi_{\lambda_1}\) and \(\phi_{\lambda_2}\) of the cause-specific hazard rates: \[\begin{gathered} \label{IF_absRisk} \phi_{F_1}(\mathcal{X},t,x,z) = \int_{0}^{t} \exp(-\Lambda_{1,z}(s|x)-\Lambda_{2,z}(s|x)) d\phi_{\Lambda_{1,z}}(\mathcal{X};s,x) \\ - \int_{0}^{t} \lambda_{1,z}(s|x) \exp(-\Lambda_{1,z}(s|x)-\Lambda_{2,z}(s|x)) \left( \phi_{\Lambda_{1,z}}(\mathcal{X};s,x) + \phi_{\Lambda_{2,z}}(\mathcal{X};s,x) \right) ds. \end{gathered} \tag{6}\] We use the shorthand notations \(v^{\otimes 0}=0, v^{\otimes 1}=v, v^{\otimes 2}=v{v}^\intercal\) and \[\int f(s,d,v,w)\mathrm d\mathrm P(s,d,v,w) = \int_0^\infty\sum_{d=0}^K\int_{R^p}\sum_{z=1}^L f(s,d,v,w)\mathrm d\mathrm P(s,d,v,w).\] We also suppress the dependence on \(\tau\) when we in the following adapt the usual notation for Cox model asymptotic theory to the cause-specific and stratified case: \[\begin{aligned} \mathcal{S}^{(r)}(t,\beta_j,z) &= \int 1\{t \le s\le \tau,w=z\}\exp (v \beta_j) v^{\otimes r} \mathrm d\mathrm P(s,d,v,w)\\ E(t,\beta_j,z) &= \frac{\mathcal{S}^{(1)}(t,\beta_j,z)}{\mathcal{S}^{(0)}(t,\beta_j,z)}\\ \mathcal{I}(\beta_j) &= \int 1\{d=j\} \left(\frac{\mathcal{S}^{(2)}(s,\beta_l,w)}{\mathcal{S}^{(0)}(s,\beta_j,w)} - E(s,\beta_j,w)^{\otimes 2}\right)\mathrm d\mathrm P(s,d,v,w). \end{aligned}\]
For fixed cause \(j\), time \(t\) and strata \(z\) the expansions \(\sqrt n (\hat \beta_j -\beta_j) = \frac{1}{\sqrt n} \sum_{i=1}^n \phi_{\beta_{j}}(\mathcal X_i) + o_{\mathrm P}(1)\) and \(\sqrt n (\hat \Lambda_{0j,z}(t) -\Lambda_{0j,z}(t)) = \frac{1}{\sqrt n} \sum_{i=1}^n \phi_{\Lambda_{0j,z}}(\mathcal X_i,t)+ o_{\mathrm P}(1)\) are then characterized by the influence functions \[\begin{aligned} \phi_{\beta_{j}}(\mathcal X) &= \mathcal{I}(\beta_j)^{-1} \biggl(1\{\tilde D=j,\tilde T<\tau\}\left(X - E(\tilde T,\beta_j,Z)\right) \notag\\ &\quad - \exp (X \beta_j) \int 1\{d=j,s\leq\tilde T\} \frac{X - E(s,\beta_j,Z)}{\mathcal{S}^{(0)}(s,\beta_j,Z)}\mathrm dP(s,d,v,w)\biggr) \label{IF_beta} \\ \end{aligned} \tag{7}\]
\[\begin{aligned} \phi_{\Lambda_{0j,z}}(\mathcal X;t) &= - \phi_{\beta_j}(\mathcal X) \int_0^t E(s,\beta_j,z) \lambda_{0j,z}(s) ds \notag \\ & \quad - 1\{Z=z\} \biggl(\exp (X \beta_j) \int_0^{min(t,\tilde T)} \frac{\lambda_{0j,z}(s) }{\mathcal{S}^{(0)}(s,\beta_j,z)} ds - \frac{1\{\tilde{T} \leq t,\tilde D=j\}}{\mathcal{S}^{(0)}(\tilde{T},\beta_j,Z)} \biggr). \label{IF_cumHazard} \end{aligned} \tag{8}\]
In absence of strata, formula (7) is equal to formula 2 of (Reid 1981) and formula (8) equals the one given in Gerds and Schumacher (2001) [top of page 576; note however that there is a sign mistake in their first term]. To connect these formulas with formula (6) it remains to note that under the Cox regression model the influence function of the cause-specific hazard rate can be written as: \[\label{IF_lambda} \phi_{\Lambda_{j,z}}(\mathcal{X};t,x) = \exp(x\beta_j) \left( \phi_{\Lambda_{0j,z}}(\mathcal{X};t) + \Lambda_{0j,z}(t)\; x \; {\phi_{\beta_j}(\mathcal{X})}^\intercal \right). \tag{9}\]
The following formulas are obtained with the plug-in principle substituting the Cox partial likelihood estimates \(\hat{\beta}_j\) for \(\beta_j\) and the baseline hazard estimates \(\hat{\lambda}_{0j,z}\) for \(\lambda_{0j,z}\) into formulas (6) - (9). We denote by \(N^z_j(t)=\sum_{i=1}^n 1\{\tilde T_i \le t, \tilde D_i=j, Z_i=z\}\) the strata and cause specific counting process, and by \(Y^{z}(t)=\sum_{i=1}^n 1\{T_i \geq t, Z_i=z\}\) the strata specific “at-risk” process (Andersen et al. 1993). The empirical estimate of the influence function of the partial likelihood estimate is given by \[\begin{aligned} \widehat{\phi}_{\beta_{j}}(\mathcal X_i) &= \widehat{\mathcal{I}}(\beta_j)^{-1} \biggl( \Delta_i \left(X_i - \hat{E}(\tilde T_i,\hat{\beta}_j,Z_i)\right) - \exp (X_i \hat{\beta}_j) \int_0^{T_i} \frac{X_i - \hat{E}(s,\hat{\beta}_j,Z_i)}{\widehat{\mathcal{S}^{(0)}}(s,\hat{\beta}_j,Z_i)}\mathrm dN_j(s)\biggr) \end{aligned}\] where \[\begin{aligned} \widehat{\mathcal{S}^{(r)}}(t,\hat{\beta}_j,z) &= \int_0^\tau \exp (X_i \hat{\beta}_j) X_i^{\otimes r}\mathrm dN^z_j(t). \end{aligned}\] The influence functions for the cumulative baseline hazard and its first differential are estimated by: \[\begin{aligned} \widehat{\phi}_{\Lambda_{0j,z}}(\mathcal X_i;t) &= - \widehat{\phi}_{\beta_j}(\mathcal X_i) \int_0^t E(s,\hat{\beta}_j,z) \hat{\lambda}_{0j,z}(s)\mathrm ds \notag \\ & \quad - 1\{Z_i=z\} \biggl( \exp (X_i \hat{\beta}_j) \int_0^{min(t, \tilde T_i)} \frac{\hat{\lambda}_{0j,z}(s) }{\widehat{\mathcal{S}^{(0)}}(s,\hat{\beta}_j,z)} ds - \frac{1\{\tilde{T}_i \leq t,\tilde D_i=j\}}{\widehat{\mathcal{S}^{(0)}}(\tilde{T}_i,\hat{\beta}_j,Z_i)} \biggr) \label{empiricalIF_Lambda0}\\ \end{aligned} \tag{10}\]
\[\begin{aligned} d\widehat{\phi}_{\Lambda_{0j,z}}(\mathcal X_i;t) &= - \widehat{\phi}_{\beta_j}(\mathcal X_i) E(t,\hat{\beta}_j,z) \hat{\lambda}_{0j,z}(t) \notag \\ & \quad - 1\{Z_i=z\} \biggl( \exp (X_i \hat{\beta}_j) 1\{t \leq \tilde T_i\} \frac{\hat{\lambda}_{0j,z}(t) }{\widehat{\mathcal{S}^{(0)}}(t,\hat{\beta}_j,z)} - \frac{1\{\tilde T_i= t,\tilde D_i=j\}}{\widehat{\mathcal{S}^{(0)}}(\tilde{T_i},\hat{\beta}_j,Z_i)} \biggr). \label{empiricalIF_lambda0} \end{aligned} \tag{11}\] These estimates lead to the following estimates of the influence functions of the covariate specific cumulative hazard and its derivative relative to the time: \[\begin{aligned} \label{empiricalIF_Lambda} \hat{\phi}_{\Lambda_{j,z}}(\mathcal{X}_i;t,x) &= \exp(x\hat{\beta}_j) \left( \widehat{\phi}_{\Lambda_{0j,z}}(\mathcal{X}_i;t) + \hat{\Lambda}_{0j,z}(t)\; x \; {\widehat{\phi}_{\beta_j}(\mathcal{X}_i)}^\intercal \right) \\ \end{aligned} \tag{12}\]
\[\begin{aligned} \label{empiricalIF_lambda} d\hat{\phi}_{\Lambda_{j,z}}(\mathcal{X}_i;t,x) &= \exp(x\hat{\beta}_j) \left( d\widehat{\phi}_{\Lambda_{0j,z}}(\mathcal{X}_i;t) + \hat{\lambda}_{0j,z}(t)\; x \; {\widehat{\phi}_{\beta_j}(\mathcal{X}_i)}^\intercal \right). \end{aligned} \tag{13}\] Finally, we obtain our estimate of the influence function of the absolute risk: \[\begin{aligned} \label{empiricalIF_risk} & \widehat{\phi}_{F_1}(\mathcal{X}_i,t,x,z) = \int_0^t \exp(-\hat{\Lambda}_{1,z}(s|x)-\hat{\Lambda}_{2,z}(s|x)) d\widehat{\phi}_{\Lambda_{1,z}}(\mathcal{X}_i;s,x) \\ & - \int_0^t \hat{\lambda}_{1,z}(s|x) \exp(-\hat{\Lambda}_{1,z}(s|x)-\hat{\Lambda}_{2,z}(s|x)) \biggl(\widehat{\phi}_{\Lambda_{1,z}}(\mathcal{X}_i;s,x) + \widehat{\phi}_{\Lambda_{2,z}}(\mathcal{X}_i;s,x) \biggr)\mathrm ds. \end{aligned} \tag{14}\]
A confidence band with confidence level \(1-\alpha\) for the absolute risk \(F_1(.|x,z)\) restricted to the time interval \(\mathcal{T} = [\tau_1;\tau_2]\) is a region \(R_{\mathcal{T}}(x,z) = \left[l_{x,z}(t);u_{x,z}(t)\right]_{t\in\mathcal{T}}\) satisfying: \[\mathrm P( F_1(t|x,z) \in \left[ l_{x,z}(t) ; u_{x,z}(t)\right] | \forall t \in \mathcal{T}) = 1-\alpha.\]
In figure 3, \(\tau_1=0\) and \(\tau_2=3458\), the time at which the last event occurred. Using the martingale central limit theorem, (Cheng et al. 1998) have shown that \(\hat{F}_1(t|x,z)-F_1(t|x,z)\) converges weakly to a zero-mean Gaussian process on \(\mathcal{T}\). The asymptotic variance of this process is \(V(t,x,z)\). However the \(1-\alpha\) quantile achieving simultaneous coverage is larger than the \(1-\alpha\) quantile of a standard normal distribution. Since the dependence between the increments of the process \(\hat{F}_1(t|x,z)-F_1(t|x,z)\) makes the derivation of an explicit expression for the quantile difficult, we used instead a resampling technique (Scheike and Zhang 2008). Consider over \(t \in \mathcal{T}\) the normalized process: \[\psi_{F_1}(\mathcal{X}_i;t,x,z) = \phi_{F_1}(\mathcal{X}_i;t,x,z)/\sqrt{V(t,x,z)}.\] Denote by \(c_{1-\alpha/2}\) the \(1-\alpha/2\) quantile of the sample: \[\sup_{t\in\mathcal{T}}\left| \psi_{F_1}(\mathcal{X}_i;t,x,z) \right|\] and using the symmetry of the Gaussian distribution, i.e. \(c_{\alpha/2} = - c_{1-\alpha/2}\), a \(1-\alpha\) confidence band over \(\mathcal{T}\) is constructed as follows: \[\begin{aligned} \left[ F_1(t|x,z) - c_{1-\alpha/2}\sqrt{\widehat V(t,x,z))}; F_1(t|x,z) + c_{1-\alpha/2} \sqrt{\widehat V(t,x,z)} \right]. \end{aligned}\] Like for the confidence intervals, the confidence bands will be restricted to the interval [0;1] when they are not computed using a log-log transformation.
The function predict.CauseSpecificCox()
calls two important functions,
predictCox()
that computes the hazard and cumulative hazard for a
fitted Cox model, and iidCox()
that computes the influence function
for the baseline hazard and regression coefficients. In this section we
first explain how predictCox()
deals with ties in the event times. We
then show that the function predictCox()
can also be used to obtain
confidence intervals and bands for covariate specific survival
probabilities in the situation without competing risks. Implementation
details about the function iidCox()
are postponed to appendix B. These
details may be useful for programmers who need to care about memory
usage.
We speak of ties when two or more observations have the same value of
the time variable \(\tilde T\). Ties occur for example when time is
recorded on a discrete scale such as months. The survival package
implements three different methods to deal with ties (“efron”,
“breslow”, and “exact”, see help(coxph)
) for the partial likelihood
estimator of the log hazard ratios \(\beta_j\) (Therneau and Grambsch 2000). We have
implemented the “efron” and the “breslow” method but not the “exact”
method. For a comparison of these methods and yet another method see
(Scheike and Sun 2007). We now state the formula for the baseline hazard
function under Breslow’s (Breslow 1974) and Efron’s method (Efron 1977) for the
handling of ties. The baseline hazard estimate in strata \(z\) given by
the Breslow method is:
\[d\hat\Lambda_{0j,z}^B(t)= \frac{\mathrm dN^z_{j}(t)}{\sum_{i \in Y^{z}(t)} \exp(\beta^t X_i)}.\]
With the Efron method for handling ties the formula is given by:
\[d\hat{\Lambda}_{0j,z}^E(t)=\sum^{\mathrm dN^z_{j}(t)}_{k=1}
\frac{1}{\sum_{i \in Y^{z}(t)} \exp(\beta^t X_i)-
\frac{k-1}{\mathrm dN^z_{j}(t)}\sum^{\mathrm dN^z_{j}(t)}_{i=1} \exp(\beta^t X_i)}.\]
Both estimators are implemented in the function predictCox()
which
provides estimates of the baseline hazard, the cumulative baseline
hazard and baseline survival. The predictCox()
function does not have
an argument to specify whether Breslow method or Efron method should be
used; instead it uses the same method that has been used to estimate the
regression coefficients. In the case of the coxph()
function, the
default method is Efron:
<- coxph(Surv(time,status != 0) ~ age + logthick + epicel + strata(sex),
f1 data = Melanoma, x = TRUE, y = TRUE)
$method f1
## [1] "efron"
Therefore the baseline hazard will be estimated using the Efron method
when calling predictCox()
:
<- predictCox(f1)
baseH1 as.data.table(baseH1[c("time","cumhazard","strata","survival")])
## time cumhazard strata survival
## 1: 99 0.00623279 Female 0.9937866
## 2: 232 0.01256406 Female 0.9875145
## 3: 279 0.01897728 Female 0.9812017
## 4: 295 0.02555481 Female 0.9747690
## 5: 355 0.03221850 Female 0.9682950
## ---
## 199: 3909 0.69673810 Male 0.4982078
## 200: 4119 0.69673810 Male 0.4982078
## 201: 4207 0.69673810 Male 0.4982078
## 202: 4310 0.69673810 Male 0.4982078
## 203: 4492 0.69673810 Male 0.4982078
The cumulative baseline hazard and baseline survival are displayed in the columns “cumHazard”, and “survival” of the output. This corresponds, respectively, to \(\Lambda_{0j,z}(t)\) and \(exp(-\Lambda_{0j,z}(t))\) where \(j\) is 1, \(z\) can be found in the column “strata” and \(t\) in “time”. The covariate specific cumulative hazard \(\Lambda_{j,z}(t|x)\) and survival \(exp(-\Lambda_{j,z}(t|x))\) can be estimated using the same function:
predictCox(f1, newdata = Melanoma[c(17,101,123),],
times = c(7,3,5)*365.25)
## observation times strata cumhazard survival
## 1: 1 2557 Male 0.884 0.413
## 2: 2 2557 Female 0.555 0.574
## 3: 3 2557 Female 0.949 0.387
## 4: 1 1096 Male 0.453 0.635
## 5: 2 1096 Female 0.202 0.817
## 6: 3 1096 Female 0.346 0.708
## 7: 1 1826 Male 0.670 0.512
## 8: 2 1826 Female 0.366 0.693
## 9: 3 1826 Female 0.626 0.535
In absence of competing risks, the influence function of a Cox model can
be used to estimate confidence intervals for the survival. This can be
done by setting the argument se
to TRUE
when calling the
predictCox()
function:
<- predictCox(f1, newdata = Melanoma[3:5,],
p1 times = c(Melanoma$time[5:7],1000),
se = TRUE, type="survival")
The predictCox()
function output an object of class "predictCox"
.
The print()
method can be used to display the confidence intervals for
the survival computed at different times:
print(p1)
## observation times strata survival survival.se survival.lower survival.upper
## 1: 1 185 Male 0.980 0.616 0.935 0.994
## 2: 2 185 Female 0.985 1.014 0.893 0.998
## 3: 3 185 Male 0.947 0.652 0.823 0.985
## 4: 1 204 Male 0.973 0.567 0.921 0.991
## 5: 2 204 Female 0.985 1.014 0.893 0.998
## 6: 3 204 Male 0.929 0.594 0.791 0.977
## 7: 1 210 Male 0.967 0.503 0.913 0.987
## 8: 2 210 Female 0.985 1.014 0.893 0.998
## 9: 3 210 Male 0.912 0.552 0.762 0.969
## 10: 1 1000 Male 0.864 0.321 0.760 0.925
## 11: 2 1000 Female 0.769 0.286 0.631 0.860
## 12: 3 1000 Male 0.673 0.391 0.426 0.832
Here the confidence intervals were computed using a log-log
transformation. The argument log.transform
can be set to FALSE
to
compute them without using the log-log transformation. Confidence bands
can also be obtained using predictCox()
by setting the argument band
to TRUE
. As for the predict.CauseSpecificCox()
function, 10,000
simulations will be used to compute the confidence bands; this can be
changed specifying the argument nsim.band
. The function
as.data.table.predictCox()
makes it easy to extract subsets from
"predictCox"
object:
<- as.data.table(p1)
p1 == 185,] p1[times
## observation times strata survival survival.se survival.lower survival.upper
## 1: 1 185 Male 0.9801395 0.6163925 0.9350597 0.9940246
## 2: 2 185 Female 0.9845660 1.0137483 0.8927600 0.9978695
## 3: 3 185 Male 0.9470887 0.6524116 0.8226132 0.9849795
To assess the validity of the estimation of the standard error, we
performed a simulation study. The sample size was varied between 50 and
10000. For each sample size, 5000 datasets were simulated using the
SimCompRisk()
function from the riskRegression package. The time
of first event typically ranged between \(0.001\) and \(20\) with a median
around \(4\). For each dataset, a Cox model specific to cause 1 and a 2
cause-specific Cox model were fitted considering 2 covariates (\(X1\) and
\(X2\)). The survival, absolute risk, their confidence intervals were
estimated (with or without log-log transformation) at time 1, 1.5, 2, 3,
4, 5, 6, and 7 conditional on \(X1=0\) and \(X2=1\). The confidence bands
over those 8 times were also computed.
The true absolute risk was defined as the median of the absolute risks over the 5000 datasets. The coverage of the confidence intervals was computed as the percentage of times that the true absolute risk was inside the confidence interval, at a given time. The coverage of the confidence bands was computed as the percentage of times that the true absolute risk was inside the confidence bands, simultaneously for all of the 8 times.
As expected, the coverage of the confidence intervals is improving with increasing sample size and reaches its nominal level of 95% at around sample size 1000 (figure 4, left panel). Using a log-log transformation leads to better small sample properties and similar large sample properties (figure 4, right panel). A larger sample size was necessary for the confidence bands to converge toward the nominal coverage level (n = 5000, figure 5 left panel). Again log-log transformation leads to better small sample properties. We only displayed here the coverage for the absolute risk but similar coverage was obtained for the survival function.
We compare the performance of predictCox()
regarding the estimate of
the baseline hazard function with that of the function
survival::basehaz()
. For this purpose we simulate data with 10
covariates including both continuous and discrete type using the
sampleData()
.
In our performance study we vary the sample size ranging from 500 to 1,000,000 observations and consider both stratified:
Surv(time,event) ~ strata(X1) + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10
and non-stratified Cox regression models:
Surv(time,event) ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10
The computation times were estimated using the
rbenchmark
package (Kusnierczyk 2012) and averaged across 100 simulated data sets.
The memory usage was estimated using the
profvis package
(Chang and Luraschi 2017) and averaged across 10 simulations. When the execution
time of the function was extremely fast (i.e. <0.005s), the memory
usage could not be reliably assessed and was set to NA
.
Memory usage and computation time are displayed on figure
6 and 7.
Both functions lead to a reasonable computation time (<1 min) even when
considering very large datasets (e.g. >10,000 observations).
Nevertheless the predictCox()
function outperforms the basehaz()
function by a factor varying between \(3\) and \(11\) in terms of
computation time. The gain in speed is especially expressed in large
datasets. Memory usage is also lower for predictCox()
and decreases by
a factor between \(1\) and \(1.6\) as compared to basehaz()
.
We now compare two implementations for computing the standard errors of
the absolute risk. The default implementation corresponds to setting the
argument store.iid
to "full"
when calling
predict.CauseSpecificCox()
or predictCox()
while the second is
obtained by setting store.iid
to "minimal"
. The two implementations
are described in more detail appendix B.
First we compare their computation time and memory consumption when
making only one prediction. As before, we simulated datasets for \(K=2\)
using the SimCompRisk()
function with increasing sample size. Then,
the two cause-specific Cox models were fitted to each of the simulated
datasets using riskRegression::CSC
. Then we measured the computation
time and memory usage necessary to estimate the absolute risk with its
standard error for the first observation at time 4, when using the
argument store.iid="minimal"
or store.iid="full"
in
predict.CauseSpecificCox()
.
The results are show on figure 8. While both
implementations lead to a similar computation time, the memory usage for
store.iid="minimal"
grows linearly while for store.iid="full"
it
grows approximately in \(n^{1.4}\). However, if instead of estimating the
absolute risk with its standard error for one prediction, we estimate it
for all the observations the implementation store.iid="minimal"
becomes slower compared to store.iid="full"
(e.g. \(28.5\) minutes
vs. \(6\) minutes for n=2000).
This paper introduces new features of the riskRegression package
for prediction of absolute risks from cause-specific Cox regression
models using computationally efficient functions. 1
summarizes the main functions described in this paper. Confidence
intervals and confidence bands for the absolute risks can be computed
using the predict()
function and displayed using the print()
method.
The predictCox()
function can be applied on "coxph"
and "cph"
objects to predict the survival with its confidence interval or
confidence bands. In both cases, the autoplot()
function can display
the predicted risk (or survival) over time. When dealing with small to
moderate sample sizes, we advise to compute confidence intervals or
confidence bands using a log-log transformation (argument
log.transform
).
CSC() |
Fit cause-specific Cox models |
predict() |
Predict covariate specific absolute risks for given time horizons |
iidCox() |
Compute the influence function of the baseline hazard estimator and of the partial likelihood estimator of the regression coefficients |
predictCox() |
Compute the (cumulative) baseline hazard function and predictions of hazards and survival probabilities in new data |
autoplot() |
Graphical display of the predicted risk across time |
The CSC()
function requires a routine to estimate the regression
coefficients of the Cox model. By default, CSC()
calls the coxph()
function from the survival package to do so. This has several
reasons: coxph()
has been thoroughly tested, is reasonably fast,
widely used and provides flexible modeling options. However in specific
contexts, other routines may be more appropriate, e.g. faster. Also,
developers that have implemented their own routine may be interested in
computing the baseline hazard or the influence function.
To be able to apply the predictCox()
and iidCox()
functions on new
classes, one needs to define the methods extracting the necessary
information from the new class. For instance, the
mets package (Holst and Scheike 2017)
contains an efficient routine for estimating Cox models. However
"phreg"
object that are quite different from "coxph"
objects:
library(mets, verbose = FALSE)
$entry <- 0
Melanoma<- phreg(Surv(entry, time, status != 0) ~ age + logthick + epicel +
f1.phreg strata(sex), data = Melanoma)
list(coxph=names(f1),
phreg=names(f1.phreg))
## $coxph
## [1] "coefficients" "var" "loglik" "score"
## [5] "iter" "linear.predictors" "residuals" "means"
## [9] "concordance" "method" "n" "nevent"
## [13] "terms" "assign" "wald.test" "x"
## [17] "strata" "y" "formula" "xlevels"
## [21] "contrasts" "call"
##
## $phreg
## [1] "coef" "ploglik" "gradient" "hessian" "U" "S0"
## [7] "nevent" "ord" "time" "jumps" "jumptimes" "strata"
## [13] "entry" "exit" "status" "p" "X" "id"
## [19] "opt" "call" "model.frame"
Therefore to be able to use the predictCox()
and iidCox()
functions
on "phreg"
one needs to define methods to extract:
(CoxCenter()
method). Instead of working on \(X\), many routines
estimating the Cox model parameters works on a centered version
\(\tilde X = X-\bar{X}\). The CoxCenter()
method returns \(\bar{X}\):
:::coxCenter.coxph(f1) riskRegression
## age logthick epicelpresent
## 52.4634146 0.6181706 0.4341463
(CoxDesign()
method). The first two columns describe the beginning
and the end of the interval of time when the individual was
followed. The third contains the event type (0 corresponding to
censoring and 1 to an observed event, e.g. death). The remaining
columns contain the design matrix corresponding to the coefficients
\(\beta\) of the Cox model and the strata variable (if any):
head(riskRegression:::coxDesign.coxph(f1))
## start stop status age logthick epicelpresent strata
## 1 0 10 1 76 1.9110229 1 2
## 2 0 30 1 56 -0.4307829 0 2
## 3 0 35 0 41 0.2926696 0 2
## 4 0 99 1 71 1.0647107 0 1
## 5 0 185 1 52 2.4915512 1 2
## 6 0 204 1 28 1.5769147 0 2
(CoxFormula()
method):
:::coxFormula.coxph(f1) riskRegression
Surv(time, status != 0) ~ age + logthick + epicel + strata(sex)
\(X\beta\) (CoxLP()
method). This function has three arguments:
object
, data
, and center
.
head(riskRegression:::coxLP.coxph(f1, data = NULL, center = FALSE))
## [1] 2.433793 1.136828 1.168604 2.332216 2.165533 1.559629
When setting data
to NULL
, the CoxLP()
method will return the
linear predictor computed on the dataset used to fit the Cox model. The
center
argument indicates whether the covariates should be centered
before computing the linear predictor.
(CoxN()
method):
:::coxN.coxph(f1) riskRegression
## [1] 205
(CoxSpecialStrata()
method):
:::coxSpecialStrata.coxph(f1) riskRegression
## [1] "strata"
(CoxStrata()
method). This variable must be univariate,
aggregating all the strata variables.
head(riskRegression:::coxStrata.coxph(f1, data = NULL, strata.vars = "strata(sex)"))
## [1] Male Male Male Female Male Male
## Levels: Female Male
Similarly to CoxLP()
, when setting data
to NULL
, the CoxStrata()
method will return the strata variable computed on the dataset used to
fit the Cox model.
(CoxVarCov()
method):
:::coxVarCov.coxph(f1) riskRegression
## age logthick epicelpresent
## age 6.553939e-05 -0.0002102408 -0.0004363469
## logthick -2.102408e-04 0.0206977373 0.0076882093
## epicelpresent -4.363469e-04 0.0076882093 0.0692047486
Most of the above methods correspond to a very small piece of code that reformats the information contained in the object, e.g.:
:::coxVarCov.coxph riskRegression
## function (object)
## {
## Sigma <- object$var
## if (!is.null(Sigma)) {
## coefName <- names(coef(object))
## colnames(Sigma) <- coefName
## rownames(Sigma) <- coefName
## }
## return(Sigma)
## }
## <environment: namespace:riskRegression>
We refer to the help page of each method for a more precise description
of each method arguments and expected output, as well as examples. Once
all of the methods have been defined for a new object (e.g. "phreg"
),
predictCox()
and iidCox()
can be applied on the new object:
all.equal(predictCox(f1), predictCox(f1.phreg))
## [1] TRUE
all.equal(iidCox(f1), iidCox(f1.phreg))
## [1] TRUE
The function iidCox()
computes the estimates of the influence
functions \(\phi_{\beta_j}\), \(\phi_{\lambda_{0j}}\), and
\(\phi_{\Lambda_{0j}}\) for a given set of covariates and time points:
<- iidCox(f1) f1.iid
The default implementation (store.iid="full"
) stores the influence
function for the baseline cumulative hazard as a list of matrices, one
for each strata. The size of each matrix is the number of observations
\(n\) times the number of unique event times \(n_T(z)\) in stratum \(z\).
Storing the influence function can be very memory demanding when
considering large datasets. This is why the influence function is only
temporary stored during the execution of the predict()
method.
When dealing with very large datasets, e.g. following \(n=20 000\)
patients during \(n_T=365\) days, storing the influence function can be
too memory demanding. Instead of computing and storing
\(\phi_{\lambda_{0j}}\), an alternative solution is to only store the
necessary quantities to compute \(\phi_{\lambda_{0j}}\):
\[\label{IFstoreMin}
\frac{1\{\tilde T_i= t,\tilde D_i=j\}}{\widehat{\mathcal{S}^{(0)}}(\tilde{T_i},\hat{\beta}_j,Z_i)}, \;
E(t,\hat{\beta}_j,z) \hat{\lambda}_{0j,z}(t), \;
\int_0^t E(s,\hat{\beta}_j,z) \hat{\lambda}_{0j,z}(s)\mathrm ds, \;
\frac{\hat{\lambda}_{0j,z}(t) }{\widehat{\mathcal{S}^{(0)}}(t,\hat{\beta}_j,z)}, \;
\int_0^t \frac{\hat{\lambda}_{0j,z}(s) }{\widehat{\mathcal{S}^{(0)}}(s,\hat{\beta}_j,z)} ds. \tag{15}\]
These quantities are lists containing \(L\) vectors of length \(n\) or
\(n_T(z)\). Since in most applications \(n\) and \(n_T(z)\) are large compared
to \(L\) this approach is much more memory efficient. Setting the argument
store.iid
to "minimal"
when calling iidCox
will return the
influence function using this alternative storage method:
<- iidCox(f1, store.iid = "minimal") f2.iid
We can compare the memory cost of the default implementation vs. the alternative one:
<- object.size(f1.iid$IFcumhazard)
size.f1.iid <- object.size(f2.iid$calcIFhazard)
size.f2.iid as.numeric(size.f2.iid/size.f1.iid)
## [1] 0.08627399
and see that the memory use for storing the influence function of the cumulative hazard has been divided by more than 10.
Computing the influence function of the absolute risk with the default implementation only requires to:
use \(\widehat{\phi}_{\Lambda_{0j,z}}(\mathcal{X}_i;t)\) and \(d\widehat{\phi}_{\Lambda_{0j,z}}(\mathcal{X}_i;t)\) with formula (12) and (13) to obtain \(\hat{\phi}_{\Lambda_{j,z}}(\mathcal{X}_i;t,x)\) and \(d\hat{\phi}_{\Lambda_{j,z}}(\mathcal{X}_i;t,x)\).
use formula (14) to obain \(\widehat{\phi}_{F_1}(\mathcal{X}_i,t,x,z)\).
When using the alternative implementation, \(\widehat{\phi}_{\Lambda_{0j,z}}(\mathcal{X}_i;t)\) and \(d\widehat{\phi}_{\Lambda_{0j,z}}(\mathcal{X}_i;t)\) have not been computed. Therefore an additional step is needed:
The two implementations will lead to the same influence function and therefore the same confidence intervals or confidence bands.
The alternative implementation is performed iterating over the set of covariates used to make the predictions, avoiding to store the influence function of the baseline hazard for all event times and strata. It will also not compute the influence function at unnecessary times and strata. Thus it should always be more memory efficient and, when asking for a single prediction, it should also be have a lower computation time. However, compared to the default implementation where \(\widehat{\phi}_{\Lambda_{0j,z}}(\mathcal{X}_i;t)\) and \(d\widehat{\phi}_{\Lambda_{0j,z}}(\mathcal{X}_i;t)\) are only computed once, the alternative implementation recomputes these quantities for each prediction.
Therefore when the prediction is to be made for many different sets of covariates (i.e. new patients) this may lead to a substantial increase in computation time. See the subsection Runtime and memory usage for more details.
To use the implementation for computing the standard errors, confidence
intervals, or confidence bands one must set the argument store.iid
to
"minimal"
:
<- predict(cfit1, newdata = newdata, cause = 1, times = c(867,3500),
pfit3se se = TRUE, store.iid = "minimal", keep.newdata = FALSE)
print(pfit3se)
## observation times strata absRisk absRisk.se absRisk.lower absRisk.upper
## 1: 1 867 sex=Female 0.021 0.122 0.00738 0.0478
## 2: 2 867 sex=Male 0.149 0.161 0.07356 0.2502
## 3: 1 3500 sex=Female 0.117 0.151 0.05552 0.2025
## 4: 2 3500 sex=Male 0.428 0.320 0.20416 0.6355
range(pfit3se$absRisk.se-pfit1se$absRisk.se)
range(pfit3se$absRisk.upper-pfit1se$absRisk.upper)
range(pfit3se$absRisk.lower-pfit1se$absRisk.lower)
## [1] -2.775558e-17 5.551115e-17
## [1] -5.551115e-17 0.000000e+00
## [1] -5.551115e-17 0.000000e+00
This option also applies when computing standard errors, confidence intervals, or confidence bands for the survival probabilities:
<- predictCox(f1, newdata = Melanoma[3:5,],
p2 times = c(Melanoma$time[5:7],1000),
se = TRUE, store.iid = "minimal", type="survival")
<- as.data.table(p2)
p2 == 185, survival.lower]-p1[times == 185, survival.lower]
p2[times == 185, survival.upper]-p1[times == 185, survival.upper] p2[times
## [1] 0.000000e+00 1.110223e-16 -1.110223e-16
## [1] 0 0 0
survival, rms, riskRegression, mstate, rbenchmark, profvis, mets
CausalInference, ClinicalTrials, Econometrics, Epidemiology, HighPerformanceComputing, ReproducibleResearch, Survival
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Ozenne, et al., "riskRegression: Predicting the Risk of an Event using Cox Regression Models", The R Journal, 2017
BibTeX citation
@article{RJ-2017-062, author = {Ozenne, Brice and Sørensen, Anne Lyngholm and Scheike, Thomas and Torp-Pedersen, Christian and Gerds, Thomas Alexander}, title = {riskRegression: Predicting the Risk of an Event using Cox Regression Models}, journal = {The R Journal}, year = {2017}, note = {https://rjournal.github.io/}, volume = {9}, issue = {2}, issn = {2073-4859}, pages = {440-460} }